Type systems for data-centric programming

Static type systems are a popular and effective tool for writing robust and correct programs. They are used to check how programs access memory, call sub-components and communicate over the network. Unfortunately, conventional static type systems are inadequate for checking properties of programs that transform, explore and visualise data. Such programs use a terse, flexible and highly dynamic programming style whose subtle logic are not readily tracked using standard notions of types, making data-centric programming difficult and error-prone.

This project will develop new foundational type system concepts for data-centric programming in languages such as R and Python. We will initially focus on two specific application areas (data frames and the grammar of graphics) where a small change in parameters often has wide-ranging implications on the program logic. We will develop a family of novel type systems, describe them formally and prove their safety. The design will leverage recent techniques (some developed by the PI) including coeffects, type providers, graded types and refinement types and it will be informed by an empirical analysis of large open-source code repositories.

Research goals

We will develop a family of type systems for robust and safe data-centric programming. We aim to identify useful correctness properties of the terse dynamic code written by data scientists and develop techniques for automatically checking them. Consider the following R code, which uses a data frame library (dplyr) and a grammar of graphics library (ggplot) to plot the number of survivors of the Titanic disaster by their age in decades:

tdec = summarize(group_by(titanic, decade=round(age/10)), count=sum(survived))
ggplot(data = tdec, aes(x=decade, y=count)) + geom_point()

Conventional static type systems cannot readily check that “age” and “survived” are numerical columns, that “tdec” will contain numerical columns “decade” and “count” needed later and that “aes” requires both “x” and “y”. We will develop type systems that can check those properties by understanding (i) the shape of the “titanic” data frame, (ii) that “group_by” evaluates its arguments with “titanic” as the scope, (iii) that “summarize” turns the provided named parameters into new columns of “tdec”, and (iv) that the required parameters of “aes” are determined by “geom_point”. Such insights can be used to explain and prevent programming errors in typical code written by data analysts.

WP 1: Type systems for data frames and grammar of graphics

The objective is to design type systems for two specific challenging domains of data-centric programming. We will contribute:

A type system for data visualization libraries based on the grammar of graphics (ggplot2, ggvis, Vega) that checks the correctness of chart construction.
A type system for data frame manipulation libraries (tibble, dplyr, pandas) that checks how individual data transformation alter the shape of data frames.
Soundness proofs and decidable type inference algorithms for the two type systems.

WP 2: General-purpose data-centric type system

The objective is to design a general-purpose type system that allows users to specify semantically rich properties for a variety of data-centric libraries. We will contribute:

A type system that allows checking of user-defined properties that involve dynamic variable scoping, data shape transformations and contextual constraints.
A checker for the type system that leverages a SMT-based verification tool.
Type annotations for existing widely-used data science libraries.

WP 3: Empirical type system design and evaluation

The objective is to inform the design of our type systems empirically and evaluate their applicability to real-world problems. We will contribute:

A tool for analysis of data-centric code from large open-source code repositories, e.g., GitHub, possibly extending existing code analysis systems.
A set of analyses that provide insights into frequently used terse and dynamic programming constructs in data-centric code.
Empirical evaluation that finds and reports errors in public open-source R or Python code using our type systems

Research methodology

The project supplements core theoretical programming language research methods with formal verification and empirical code analysis, building on the PI’s work on coeffects [3], type providers [2] and data visualization [1,4]. Work packages WP1 and WP3 can be undertaken in parallel by two independent PhD students; WP2 extends WP1 and will benefit from verification expertise in D3S.

WP 1: Type systems for data frames and grammar of graphics

We will model the semantics of the layered grammar of graphics [5] and data frame libraries using operational semantics. The type system design will use coeffects [3] to model contextual requirements (e.g., scales in chart construction); row types and type providers [2] provide basis for checking of data transformations (e.g., reading of data, typing of “group_by”). We will combine a carefully restricted general type system concepts (a design approach used in TypeScript), tailored to suit data-centric programming idioms that we identify in WP3.

WP 2: General-purpose data-centric type system

We will develop a system based on refinement types [7] and grading [3,8] to allow analysts and library authors to express domain-specific constraints on the shape, format and statistical distribution of data. Checking such logical predicates will be done using a SMT-based theorem prover such as Z3 or Viper [9]. Using those methods for data-centric programming requires developing a vocabulary of high-level domain-specific constraint (informed by WP1) and extensions to the underlying SMT-based theorem provers, such as an SMT theory of data frames.

WP 3: Empirical type system design and evaluation

Programming languages (e.g., TypeScript, Hack) and formal methods (e.g., type system for core R [6]) are increasingly designed to reflect patterns in existing codebases. Such work is best informed by empirical analysis of open-source code repositories such as GitHub. We will analyse existing codebases to identify patterns in data science code to inform the design of our type systems, as well as to evaluate their effectiveness. We aim to extend an existing framework such as CodeDJ [10].

Open positions

We are looking for two PhD students and one post-doctoral researcher to join the project.

References

[1] Perera et al. (2022). Linked visualisations via Galois dependencies. POPL
[2] Petricek (2017). Data Exploration through Dot-driven Development. ECOOP
[3] Petricek et al. (2014). Coeffects: a calculus of context-dependent computation. ICFP
[4] Petricek (2021). Composable data visualizations. JFP 31, e13
[5] Wickham (2010). A layered grammar of graphics. J. Comput. Graph. Stat., 19(1), 3-28
[6] Turcotte et al. (2020). Designing types for R, empirically. PACMPL 4 (OOPSLA), 1-25.
[7] Vazou et al. (2015). Bounded refinement types. ICFP, 48-61
[8] Orchard et al. (2019). Quantitative program reasoning with graded modal types. PACMPL 3 (ICFP), 110
[9] Müller et al. (2016). Viper: A verification infrastructure for permission-based reasoning. VMCAI
[10] Maj et al. (2021). CodeDJ: Reproducible Queries over Large-Scale Software Repositories. ECOOP

Project data

Duration: 2024-2027
Funding body: PRIMUS, Charles University
Contact: Tomáš Petříček