Bc. thesis, Master thesis or Research project

Code written by data scientists to explore data (in Jupyter Notebooks and similar systems) is interesting for a number of reasons. First, it often works with concrete data sources and so you can more closely integrate data into programs. Second, it often interleaves execution and programming (you write a bit of code, run it and then continue editing). The following ideas for projects around programming with data use those two ideas in one way or another.

  • Analysing data science code - First of all, it would be interesting to know how exactly data analytical code looks like. The idea of this project is to scrape code from GitHub and capture the differences between code written to analyze data (e.g. Python in Jupyter Notebooks) and normal source code (e.g. Python libraries and applications). In what ways are the two different?

  • Data integration via Type Providers - One interesting way of integrating data into programming languages is using type providers (see F# Data library and TheGamma project). Type providers generate types, based on some schema or logic, that a programmer can then use to access data. Related project may include implementing type providers for interesting data sources (e.g. Semantic Web and Linked Data) and also add support for type providers to other languages (e.g. TypeScript).

  • AI-based tools for data exploration - Tools like Copilot help programmers write code - but how can they most effectively be integrated with tools for data exploration? One possible idea is to see how to use the underlying machine learning models to semi-automatically construct code for accessing data (possibly using the above type providers) that the user is interested in, based on natural language queries.

  • Data visualizations to encourage critical thinking - How can we visualize data so that the result makes viewers think more critically about what they see? A nice example of this is the You Draw It visualization by New York Times. How can we built other visualizations like this? And could we also encourage readers to critically think about model behind the data (e.g. for Agent-based economic models)?