Benchmarking and Performance Evaluation

Talk to us:

We like to measure how fast software is. And why it is getting slower compared to the previous release.

We work on automating the detection of performance regressions, we develop the Renaissance benchmarking suite and have 10 years of performance data collected when measuring the GraalVM compiler.

Our work mostly revolves around regression benchmarking and around the Renaissance benchmarking suite. For each of these topics we describe several concrete projects to give a better idea about the things we like to work on.

Regression Benchmarking

Regression benchmarking refers to the practice of evaluating performance of individual software releases with performance tests, much as it is done with functional tests in functional regression testing. The following topics are related to our ongoing research activities in regression benchmarking:

Evaluating regression detection accuracy

In the past years, we have collected a large database of measurements, partially annotated with information about performance regressions. This task is about coming up with data analysis algorithms, including possible applications of deep learning approaches, to identify performance regressions in the measurements, and evaluating the detection accuracy and efficiency.

Medium size project for people interested in data analysis and real life software performance, suitable as individual research project, bachelor or master thesis. Some knowledge of statistics and general experimental methodology is useful.

Alternatively, the task can be about implementing innovative approaches to detection research, for example in the form of an API that would permit crowdsourcing the data analysis.

Small to medium size project for people interested in programming, suitable as individual research project, bachelor or master thesis. Knowledge of Python and Django, plus API technologies in general, is useful.

Summarizing regression test results

Our regression measurements dashboard often reports multiple regressions across a short span of several days. For developer convenience, it would be useful to generate brief textual summaries for weekly or monthly measurement results. The summaries could use the emerging LLM technologies to achieve fluency, however, factual accuracy is of utmost importance.

Small to medium size project for people with existing knowledge of LLM, or willing to individually acquire additional expertise on LLM topics, suitable as individual research project, bachelor or master thesis.

Seeking regression root causes

Discovering the root cause (reason) of a particular performance regression is usually a manual task. This task is about coming up with methods and tools for easing this task. Of particular interest is the ability to identify incidental performance changes that do not require developer attention.

Medium to large size project for people who enjoy researching difficult open problems, suitable as individual research project or master thesis. Knowledge of compilers and other programming tools, and computer architectures, is useful.

Assessing performance test coverage

Although the concept of test coverage is well defined for functional tests, an analogous concept for performance tests is still missing. This task is about coming up with ways to characterize performance test coverage.

A medium to large size project for people who enjoy open research topics, suitable as individual research project or master thesis. Knowledge of compilers and general working of computers is useful.

Renaissance: a Modern JVM Benchmark Suite

The development of modern compilers and programming language runtimes relies on empirical measurements to assess application performance. These measurements require benchmark suites that are reasonably compact but also represent practical applications. One such suite is Renaissance.

Renaissance is in constant development and provides many contribution opportunities:

Extending the suite with benchmarks in other languages

Java Virtual Machine can execute a spectrum of languages supported by modern dynamic compilers (C, R, JS, Ruby and so on). Renaissance currently only supports Java and Scala, this task should extend the build system and include more workloads.

Medium size project for people who like coding in diverse environments, knowledge of Java, Scala build system, and build systems in general is useful.

Extending the suite with support for natively compiled serverless workloads

With the introduction of native compilation for JVM based languages, serverless workloads can reduce cold start delays. In this task, we want to extend the Renaissance harness to support metrics relevant to serverless workloads (especially cold start time), and introduce workloads that represent typical serverless applications.

Medium size project for people interested in serverless and performance engineering. Knowledge of Java, Scala, and serverless platforms is useful.

Extending the existing workloads with an AI workload

The most important aspect of Renaissance suite is that it provides a diverse set of workloads. In this task, we would like to extend the suite with a new workload from the world of artificial intelligence. The workload must be a reasonably isolated representative of a typical AI task.

Small to medium size project for people interested in both AI and performance engineering. Knowledge of Java, Scala, and AI frameworks is useful.

Augmenting the existing workloads with validation

In the context of compiler development, execution correctness is not necessarily guaranteed. To detect incorrect compiler or runtime behavior, some workloads in Renaissance validate their own results. This task should extend validation to all existing Renaissance workloads.

Small to medium size project for people who like coding in diverse environments, knowledge of Java, Scala, and frameworks such as Spark is useful.

Analyzing the workload coverage

It is essential that the benchmark workloads are diverse enough to approximate many applications, but also compact enough to execute in relatively short time. This task should come up with ways to analyze the coverage of the compiler or runtime by the benchmark.

A medium to large size project for people who enjoy open research topics, knowledge of compilers and general working of computers is useful.