Lecture #8 | NSWI200

Lectures: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13.

Memory Models

This is the last of three self study modules that will look at the synchronization issues when using multiple threads for concurrent programming. The goal of this module is to hint at the complexities associated with memory operation ordering, which were set aside until now.

At the end of this module, you should be able to:

demonstrate and recognize opportunities for memory access optimizations performed by common compilers,
demonstrate the effect of caches on memory access speed on sample code,
demonstrate the cost of cache coherency on sample code,
recognize sequentially consistent concurrent execution,
recognize a data race and use synchronization to fix it,
draw the program order relation for simple program fragments,
draw the synchronization order relation for simple program fragments.

Memory Access Optimizations

One of the most expensive operations in programs are memory accesses. That is why our compilers and processors try to eliminate memory accesses whenever possible. For example, consider this code:

int find_non_zero (int *array) {
    int position = 0;
    int value;
    do {
        value = array [position];
        position ++;
    } while (value == 0);
    return (value);
}

We do not really expect the compiler to emit code that would fetch the values of array and position from memory every time array [position] is computed, or that would update the value of position in memory every time it is incremented (or indeed stored position in memory at all). We are used to seeing code like this:

find_non_zero:
        movl    4(%esp), %eax
.L2:
        movl    (%eax), %edx
        addl    $4, %eax
        testl   %edx, %edx
        je      .L2
        movl    %edx, %eax
        ret

Without going into detail, we can see that the code fetches the value of array from memory to the %eax register as the very first thing it does. Similarly, the value variable is kept in the %edx register rather than in memory, and the position variable is even completely eliminated, instead the result of computing array [position] is incrementally updated in %eax.

The compiler is not where the things stop though. To reduce latency, the processor will fetch memory content in blocks of 64 bytes and store them in a local cache (these blocks are called cache lines). Since our program iterates through the input array in steps of four bytes (integer size here), only the first access to every such block will require a memory access, the next 64/4-1=15 accesses will be satisfied from the cache and therefore fast.

The processor does even more. In the code, moving the pointer in %eax to the next array element (the addl $4, %eax instruction) happens after the current array element is read (the movl (%eax), %edx instruction), however, the read can take long and the processor can compute the next pointer value even before the read completes. Also, the processor can easily detect that the code accesses memory sequentially, it can therefore initiate the read of the next cache line even before the code actually asks for it. In the worst case, such prefetch will be useless and waste some cache space and some memory bandwidth, but ideally it will reduce latency when that cache line is needed.

Please take a look at the MESI protocol explanation and animation at https://www.scss.tcd.ie/Jeremy.Jones/VivioJS/caches/MESIHelp.htm to see an example how the data is moved between memory and caches in a multiprocessor system and how cache coherency is maintained. You may want to consult Wikipedia or other online resources for a quick explanation of the protocol. Do not go into detail, the important points you should take away are:

access to memory happens through cache lines,
cache lines have states (modified, exclusive, shared, invalid),
multiple readers can have multiple cache lines all in shared states,
each writer requires exclusive cache line and invalidates other shared cache lines.

[Q] Here is a slightly modified version of the example program that demonstrates data race on a shared counter:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

int counters [256];

#define LOOPS 100000000

void *counter_thread_body (void *arguments) {
    int *counter = (int *) arguments;
    for (int i = 0 ; i < LOOPS ; i ++) {
        __atomic_fetch_add (counter, 1, __ATOMIC_SEQ_CST);
    }
    return (NULL);
}

int main (int argc, char *argv []) {

    // One counter index comes from command line.
    int index = atoi (argv [1]);

    pthread_t thread_one, thread_two;

    // Launch two threads that both execute the same body.
    // Each thread will increment a different counter.
    pthread_create (&thread_one, NULL, counter_thread_body, &counters [0]);
    pthread_create (&thread_two, NULL, counter_thread_body, &counters [index]);
    // Wait for the two threads to finish.
    pthread_join (thread_one, NULL);
    pthread_join (thread_two, NULL);

    return (0);
}

This program accepts one argument on the command line, a counter index. It then launches two threads that each increment a different counter, thread_one increments counters [0], thread_two increments counters [index] specified at the command line.

(When launched with index 0, the example becomes the same as those you already saw.)

Run the example with index 1 and with index 42 and time each run. One of the two cases will most likely be much slower than the other. Which one and why ?

Hint ...

The expensive operation here is the memory access. We know it takes place in units of cache lines.

Memory Access Ordering

As a direct implication of the optimizations mentioned above, the actual manner in which memory is accessed by an executing program can radically differ from what can be seen in the source code or even in the machine code.

This poses an obvious problem for programming with threads and shared memory. If synchronization is carried out through a carefully orchestrated set of memory accesses that the individual threads carry out, how can we be sure the program will still work after the optimizations ?

As one possibly troublesome example, consider the following code:

volatile int data;
volatile bool data_is_valid = false;

void thread_one_function (void) {
    data = some_data_computation ();
    data_is_valid = true;
}

void thread_two_function (void) {
    while (!data_is_valid) {
        // Busy wait for signal to indicate that data was set.
    }
    printf ("Data is %i\n", data);
}

The idea of the code is to have one thread produce data and then use the data_is_valid variable to tell the other thread that data is ready. But if the compiler or the processor do not know that data and data_is_valid are related, one or both can change the memory access order so that data is not set by the time thread_two_function tries to read it even though data_is_valid suggests otherwise.

This may look like a somewhat contrived example, however, consider this one:

volatile singleton_t *singleton = NULL;
lock_t singleton_creation_lock;

singleton_t *get_singleton () {
    if (singleton == NULL) {
        singleton_creation_lock.lock ();
        if (singleton == NULL) {
            singleton = new singleton_t ();
        }
        singleton_creation_lock.unlock ();
    }
    return (singleton);
}

The code sketches an attempt at Double Checked Locking, a design pattern that provides a delayed creation of the singleton object. The design pattern tries to prevent a race when multiple threads would query a non existent singleton at the same time, but it also tries to avoid locking on the fast path once the singleton is created. The code does not work as intended though - for example, there are no guarantees the singleton variable will be set only after the singleton_t instance is fully constructed. A similarly incorrect version of the pattern was published for example in the famous POSA book and it took the developer public several years to realize the pattern is broken.

Memory Models

To enable robust programming with threads and shared memory, programming languages have introduced memory models. A memory model is a set of formal rules that describe guarantees on interaction through shared memory. The memory model must reconcile two opposing concerns:

the model must be simple enough and restrictive enough to permit comfortable use by software developers, and
the model must be flexible enough to permit reasonable optimizations in both the compiler and the processor.

Please take a look at the Java memory model. Probably the best document to use is JSR133, it should be enough to read the informal part, roughly up to and including Section 6. There is also a starting page with a lot of additional material. What you should take away is:

what is a data race,
what is a program order,
what is a synchronization operation,
what is a sequentially consistent execution.

To implement a memory model, the programming language must rely on guarantees provided by the processor. Please take a look into the Intel Processor Manual Volume 3A Section 8.2 Memory Ordering for examples of guarantees on memory access order provided in practice. Also see the types of memory ordering fences (sometimes also called memory barriers) that the processor implements.

Eager For More ?

Want to delve into the topic beyond the standard module content ? We’ve got you covered !

- More about memory models ?
    - Memory Models Series at https://research.swtch.com/mm (very nice, recommended !)
    - Go Memory Model at https://go.dev/ref/mem
    - C++ Memory Model material at https://hboehm.info/c++mm
    - C++ Memory Model atomic operations description at https://en.cppreference.com/w/cpp/atomic/memory_order

- More about caches ?
    - Stream prefetcher analysis at https://doi.org/10.1109/EuroSPW51379.2020.00098