2.4.3. Memory Models

The synchronization examples presented so far have assumed that operations happen in the same order as specified in code. While this facilitates understanding, it is not necessarily true - the operations in code are translated by the compiler into processor instructions, and the processor instructions direct the processor to execute series of memory operations - and both the compiler and the processor may decide to change the order of operations, typically to improve execution efficiency.

Obviously, it is not possible to permit arbitrary changes to memory operation order. Compilers and processors therefore adhere to memory models, which are formal sets of rules that define how memory operations behave. A memory model is often designed to approximate a reasonably intuitive behavior for most programs, and to only provide exceptions where such behavior would lead to a significant loss of efficiency. Such reasonable behavior may be sequential consistency or linearizability.

Following are brief examples of some memory models. As a disclaimer, these are all necessarily inaccurate explanations that summarize multiple pages of formal memory model definitions into a few paragraphs.

References. 

  1. Hans Boehm: Threads Basics.

  2. Hans Boehm: Threads Cannot Be Implemented As A Library.

2.4.3.1. Example: Memory Model On Intel 80x86 Processors

The Intel 80x86 processors guarantee that all read and write instructions operating on shared memory are atomic when using aligned addresses. Other instructions may or may not be atomic depending on the particular processor. In particular, read and write instructions operating within a single cache line are often atomic, while read and write instructions operating across cache lines are not. Read-modify-write instructions can be made atomic using a special LOCK prefix.

Also starting with the Intel Pentium 4 processors, multiple memory ordering models were introduced to enable optimization based on reordering of memory accesses. The basic memory ordering model works roughly as follows:

  • Reads by a single processor are carried out in the program order.

  • Most writes by a single processor are carried out in the program order.

  • Reads and writes by a single processor to the same address are carried out in the program order.

  • Younger reads and older writes by a single processor to different addresses are not necessarily carried out in program order.

  • Writes by a single processor are observed in the program order by the other processors.

  • Writes by multiple processors are not necessarily observed in the same order by these processors, but are observed in the same order by the other processors.

  • Reads and writes are causally ordered.

Other memory ordering models include the strong ordering model, where all reads and writes are carried out in the program order, or the write back ordering model, where writes to the same cache line can be combined. The memory ordering models can be set on a per page and a per region basis.

The LFENCE, SFENCE, MFENCE instructions can be used to force ordering. The LFENCE instruction forms an ordering barrier for all load instructions, the SFENCE instruction forms an ordering barrier for all store instructions, the MFENCE instruction does both LFENCE and SFENCE.

Starting with the Intel Pentium 4 processors, the processor family introduced the MONITOR and MWAIT instruction pair. The MONITOR instruction sets up an address to monitor for access. The MWAIT instruction waits until the address is accessed. The purpose of the instruction pair is to optimize multiprocessor synchronization, because the processor is put into power saving mode while waiting for the access.

The PAUSE instruction can be used inside an active waiting cycle to reduce the potential for collisions in instruction execution and to avoid executing the active waiting cycle to the detriment of other work on hyperthreading processors.

References. 

  1. Intel: Intel 64 and 32 Architectures Software Developer Manual.

2.4.3.2. Example: Memory Model On MIPS32 Processors

The MIPS32 processors may implement a cache coherency protocol. One of five memory coherency models can be set on a per page basis.

  • Uncached - the data are never cached, reads and writes operate directly on memory.

  • Noncoherent - the data are cached, reads and writes operate on cache, no coherency is guaranteed.

  • Sharable - the data are cached, reads and writes operate on cache, write by one processor invalidates caches of other processors.

  • Update - the data are cached, reads and writes operate on cache, write by one processor updates caches of other processors.

  • Exclusive - the data are cached, reads and writes by one processor invalidate caches of other processors.

The LL and SC instructions can be used to implement a variety of test-and-set and compare-and-swap operations. The LL instruction reads data from memory, and additionally stores the address that was read in the LLaddr register and sets the LLbit register to true. The processor will set the LLbit register to false whenever another processor performs a cache coherent write to the cache line containing the address stored in the LLaddr register. The SC instruction stores data to memory if the LLbit register is true and returns the value of the LLbit register.

References. 

  1. MIPS Technologies: MIPS32 4K Processor Core Family Software User Manual.

  2. Joe Heinrich: MIPS R4000 Microprocessor User Manual.

2.4.3.3. Example: Memory Model In C++

The memory model defines the order within a single thread by a sequencing relation. Even a single thread may have operations that are not sequenced with respect to each other, these are often expressions that have multiple side effects on the same data. Where a result depends on the order of operations that are not sequenced, the result is generally undefined.

Threads can perform synchronization operations. Typically, these are either atomic variable operations or library synchronization operations. A synchronization operation is classified as release, acquire or consume. An atomic variable write can be a release operation, an atomic variable read can be an acquire or a consume operation. A lock operation is an acquire operation, an unlock operation is a release operation.

  • Release and later acquire of the same object gives rise to synchronization order.

  • Release and later consume of the same object gives rise to dependency order.

  • Together with sequencing, this creates the happens-before order.

Intuitively, the happens-before order captures operation ordering that is explicit in the program, that is, ordering that is either due to sequencing in a single thread or due to synchronization operations between threads. Operations that access the same object and include writes must be ordered by the happens-before order, otherwise they are considered a data race and their behavior is undefined. In general, the last write operation in the happens-before order is also the visible one.

The memory model considers locations that are unique except for bit fields, multiple adjacent bit fields share the same memory location. Memory locations are updated without interference.

2.4.3.4. Example: Memory Model In Java

The memory model of the Java programming language distinguishes correctly synchronized programs and programs with races. The formal definition of the two cases relies on programming language constructs that necessarily introduce ordering:

  • The execution order prescribed by the program code for a single thread defines the program order of actions. Simply put, statements that come earlier in the program code are ordered before statements that come later.

  • The execution order prescribed by synchronization between threads defines the synchronization order of actions. For example, unlocking a lock comes before locking the same lock later. Besides locks, synchronization order also considers thread lifecycle actions and accesses to volatile variables.

Both the program order and the synchronization order are evident and intuitive. The memory model further defines a happens-before order, which can be roughly described as a transitive closure of the program order and the synchronization order.

To define whether a program is correctly synchronized, the memory model asks whether potentially dangerous accesses to shared variables can occur in certain reasonable executions of the program:

  • Each execution of a program is defined as a total order over all program actions.

  • Reasonable executions are executions where actions observe the program order and the synchronization order and where all reads return values written by the latest writes. These executions are called sequentially consistent.

  • Potentially dangerous accesses to shared variables are accesses to the same variable that include writes, when the accesses are not ordered by the happens-before order. Such accesses constitute a data race.

A program is said to be correctly synchronized when its sequentially consistent executions contain no data race.

When accessed in all its formal glory, the definition of correctly synchronized programs is complex. Still, it has a very intuitive motivation. We know that it is potentially dangerous to access shared variables without synchronization. The memory model tells us that a program is correctly synchronized if all accesses to shared variables are ordered through synchronization in any possible interleaving of the individual program threads.

In reality, program execution is not limited to interleaving of the individual threads. Both the compiler and the processor may decide to change the order of certain actions. For programs that are correctly synchronized, the memory model guarantees that any execution will appear to be sequentially consistent - that is, whatever reordering happens under the hood, the observed effects will always be something that could happen if the individual threads were simply interleaved.

For programs with races, the memory model guarantees are weaker. Reads will not see writes in contradiction of the happens-before order. One, a read will not see a write when that write should come later than the read. Two, a read will not see a write when that write should be hidden by a later write that itself comes sooner than the read. The memory model also prohibits causality loops, the formal requirement is based on committing individual actions within executions in a manner that prevents causality loops from occuring.

Accesses to all basic types except long and double are atomic in Java.

References. 

  1. James Gosling, Bill Joy, Guy Steele, Gilad Bracha: The Java Language Specification.