3.2.3. Heap

The process heap is used for dynamically allocated variables. The heap is stored in one or several continuous blocks of memory within the virtual address space. These blocks include a data structure used to keep track of what parts of the blocks are used and what parts of the blocks are free. This data structure is managed by the heap allocator of the process.

In a sense, the heap allocator duplicates the function of the virtual memory manager, for they are both responsible for keeping track of blocks of memory. Typically, however, the blocks managed by the heap allocator are many, small, short-lived and aligned on cache boundaries, while the blocks managed by the virtual memory manager are few, large, long-lived and aligned on page boundaries. This distinction makes it possible to design the heap allocator so that it is better suited for managing blocks containing dynamically allocated variables than the virtual memory manager. Usually, the heap allocator resides within a shared library used by the processes of the operating system. The kernel of the operating system has a separate heap allocator.

3.2.3.1. Heap Allocators

Obvyklými požadavky na alokátor jsou rychlost (schopnost rychle alokovat a uvolnit paměť), úspornost (malá režie dat alokátoru a malá fragmentace) funkčnost (resizing, align, zero fill).

Alokátory evidují volnou a obsazenou paměť zpravidla buď pomocí seznamů nebo pomocí bitmap. Bitmapy mají dobrou efektivitu při alokaci bloků velikosti blízké jejich granularitě, nevýhodou je interní fragmentace, taky se v nich blbě hledá volný blok požadované délky. U linked lists asi taky není co dodat, režie na seznam, externí fragmentace, sekvenční hledání, oddělené seznamy plných a prázdných bloků, zvláštní seznamy bloků obvyklých velikostí aka zones, scelování volných bloků.

Při alokaci nového bloku je možné použít několik strategií. Nejjednodušší je first fit, případně modifikace next fit. Dalším je best fit, který ovšem vytváří malé volné bloky. Zkusil se tedy ještě worst fit, který také nebyl nic extra. Udržování zvláštních seznamů častých velikostí se někdy nazývá quick fit. Sem asi patří i buddy system, to jest dělení partitions na poloviční úseky u seznamů bloků obvyklých velikostí, problém s režií bloků délek přesně mocnin dvou.

Statistiky overheadu pro konkrétní aplikace uvádějí 4% pro best fist, 7% pro first fit na FIFO seznamu volných bloků, 50% pro first fit na LIFO seznamu volných bloků, 60% pro buddy system.

[M. S. Johnstone & P. R. Wilson: The Memory Fragmentation Problem Solved, ACM SIGPLAN 34/3, 3/1999]

Podívat se na [P. R. Wilson & M. S. Johnstone & M. Neely & D. Boles: Dynamic Storage Allocation A Survey And Critical Review, International Workshop on Memory Management, September 1995, ftp://ftp.cs.utexas.edu/pub/garbage/allocsrv.ps]

Buddy system. Výhodou buddy systému má být zejména to, že se při uvolňování bloku dá snadno najít kandidát na spojení do většího volného bloku. Nevýhodou je potenciálně vysoká interní fragmentace, daná pevnou sadou délek bloku.

Implementace buddy systému potřebuje někde uschovávat informace o blocích a seznamy volných bloků. To se dá dělat například v hlavičkách u samotných bloků, čímž jsou vlastní bloky o něco menší, ale v hlavičce není potřeba příliš mnoho informací. Alternativně se vedle alokované paměti umístí bitmapa s jedním bitem pro každý blok a každou úroveň buddy systému.

Mimochodem, když už jsme u toho, multiprocesorové systémy mají u alokátorů podobné problémy jako plánovače nad ready frontou, tedy příliš mnoho souběhů je zpomaluje. Proto se dělají hierarchické alokátory, local free block pools, které se v případě potřeby přelévají do global free block poolu.

3.2.3.1.1. Example: dlmalloc Heap Allocator

dlmalloc is a general purpose allocator written and maintained by Doug Lea since 1992. It was used in early versions of GNU LibC and other allocators, such as ptmalloc or nedmalloc, are derivations. The allocator is suitable for workloads with block sizes from tens of bytes and few concurrent allocations. The current description is for version 2.8.6 from August 2012.

Internally, the allocator distinguishes three ranges of block sizes. [1]

  • For blocks of less than 256 bytes, the allocator keeps an array of linked lists called bins. Each bin contains free blocks of one particular size, from minimum block size of 16 bytes to 256 bytes in steps of 8. Requests for blocks in this range are satisfied from the bins using exact fit. If exact fit is not available, the last best fit block that was split is used.

  • For blocks between 256 bytes and 256 kilobytes, the allocator keeps an array of bins for ranges of sizes in powers of two. Each bin is a binary trie whose elements are lists of blocks of equal size. In earlier versions, each bin was a sorted list of blocks. Best fit is used for allocation from bins.

  • Finally, mmap is used to allocate blocks larger than 256 kilobytes if free block is not available.

Both free and allocated blocks, called chunks, have headers and footers. Free blocks also carry trie or list references where allocated blocks store user data. This simplifies management of allocator structures as well as block splitting and coalescing operations.

Figure 3.1. Allocated Chunk Structure

chunk +-----------------------------------------------+
      | Size of previous chunk (if P = 0)             |
      +-------------------------------------------+---+
      +-------------------------------------+---+ | P |
      | Size of this chunk                  | 1 | +---+
data  +-------------------------------------+---+-----+
      |                                               |
      : User data (size - sizeof (size_t) bytes)      :
      |                                               |
chunk +-----------------------------------------------+
      | Size of this chunk again                      |
      +-------------------------------------------+---+
      +-------------------------------------+---+ | 1 |
      | Size of next chunk                  | U | +---+
data  +-------------------------------------+---+-----+

// Adjusted from dlmalloc source code comments.

Figure 3.2. Free Chunk Structure

chunk +-----------------------------------------------+
      | Size of previous chunk (if P = 0)             |
      +-------------------------------------------+---+
      +-------------------------------------+---+ | P |
      | Size of this chunk                  | 0 | +---+
      +-------------------------------------+---+-----+
      | Pointer to next chunk in bin list             |
      +-----------------------------------------------+
      | Pointer to previous chunk in bin list         |
      +-----------------------------------------------+
      | Pointer to left child when in bin trie        |
      +-----------------------------------------------+
      | Pointer to right child when in bin trie       |
      +-----------------------------------------------+
      | Pointer to parent when in bin trie            |
      +-----------------------------------------------+
      | Bin index when in bin trie                    |
      +-----------------------------------------------+
      |                                               |
      : Free                                          :
      |                                               |
chunk +-----------------------------------------------+
      | Size of this chunk again                      |
      +-------------------------------------------+---+
      +-------------------------------------+---+ | 1 |
      | Size of next chunk                  | U | +---+
data  +-------------------------------------+---+-----+

// Adjusted from dlmalloc source code comments.

There is no support for efficient handling of concurrent allocations. The allocator uses a single big lock to protect its structures.

References. 

  1. Doug Lea: A Memory Allocator. http://gee.cs.oswego.edu/dl/html/malloc.html

3.2.3.1.2. ptmalloc Heap Allocator

ptmalloc is a general purpose allocator written and maintained by Wolfram Gloger since 1999. It is derived from dlmalloc and used in late versions of GNU LibC. The current description is for version 2 from May 2006.

As far as the core allocation structures and algorithms go, ptmalloc is much the same as dlmalloc. Some minor differences are:

  • The allocator uses sorted lists instead of binary tries in bins for large blocks.

  • The allocator uses a special array of fast bins for blocks of up to 80 bytes. When freed, these blocks are kept marked as used and put into the appropriate fast bin. Allocation is satisfied using exact fit from fast bins when possible. Fast bins are emptied under multiple heuristic conditions including when an exact fit allocation from any fast bin fails, when a large block is allocated or freed, when fast bins contain too many blocks or when the allocator needs more memory.

  • Even blocks that do not fit into fast bins are not returned to standard bins immediately. Instead, they are kept marked as used and put into a special unsorted list. On allocation, this list is traversed and if an exact fit is found, the block is reused. All blocks that are not an exact fit are put into bins during the traversal. To avoid anomalies, at most 10000 blocks are examined in one allocation.

The big difference between ptmalloc and dlmalloc is in support for concurrent allocations. The ptmalloc allocator maintains multiple arenas. Each arena is an independent instance of the allocator protected by a single big lock. On allocation, the thread invoking the allocator attempts to lock an arena, starting with the one it used last. If an arena is locked successfully, it is used to satisfy the allocation request, otherwise a new arena is allocated.

To find an arena given a block pointer, simple pointer arithmetic is used. All arenas except one start at addresses that are multiples of maximum arena size, one megabyte by default. Masking the block pointer bits that address within the arena therefore returns the arena start pointer. One arena is designated as main arena and not subject to the size limit. Main arena blocks are identified by header bits.

References. 

  1. Wolfram Gloger: A Memory Allocator. http://www.malloc.de

3.2.3.1.3. Example: tcmalloc Heap Allocator

tcmalloc is a general purpose allocator written and maintained by Sanjay Ghemawat since 2004. The current description is for version 2.1 from July 2013.

The allocator fetches memory from the system in spans, which are continous sequences of pages. A span is either free or assigned to a single large block or assigned to small blocks of the same size class. The allocator maintains a tree to convert a block pointer page to a span reference, augmented with a mapping cache and stored in system allocated memory blocks. In this arrangement, small blocks do not need individual headers.

Free memory in spans is referenced by free lists from either thread local caches or central heap. The free lists themselves are LIFO single linked lists stored within the free blocks. The allocator behavior differs based on block size:

  • Small blocks are blocks with size below 32 kilobytes. Each cache has free lists pointing to free blocks of given size. There are about 60 size classes spaced from minimum size to 32 kilobytes. Spacing starts at 8 bytes for smallest classes and ends at 256 bytes for largest classes. Central heap has free lists of the same structure.

    Allocation goes into the local free lists. Since there are not block headers, size is always rounded up to nearest class. If the local exact fit fails, an adaptive number of free blocks is pulled from the central heap. If even the central heap exact fit fails, a new span is allocated and split into free blocks of same size.

  • Large blocks are blocks with size above 32 kilobytes. These are allocated only on central heap, which has 256 free span lists for sizes from 32 kilobytes in steps of 4 kilobytes, last for all larger sizes.

    Allocation uses best fit on the free lists. If the best fit fails, a new span is allocated.

When freeing a block, the allocator first identifies the span. Small blocks are returned to the thread cache of current thread. If the number of free blocks exceeds an adaptive threshold, the cache is shrunk. Large blocks are merged with free neighbors and returned to the span list of the central heap.

References. 

  1. Sanjay Ghemawat: TCMalloc: Thread-Caching MAlloc. http://github.com/gperftools/gperftools/blob/master/docs/tcmalloc.html

3.2.3.1.4. Example: Hoard Allocator

To be done.

3.2.3.1.5. Example: Posix Heap Allocator Interface

To be done.

3.2.3.1.6. Example: Linux Kernel Slab Allocator

To be done.

// Create a slab cache
kmem_cache_t * kmem_cache_create (
  const char *name,
  size_t size,
  size_t align,
  unsigned long flags,
  void (* ctor) (void *, kmem_cache_t *, unsigned long),
  void (* dtor) (void *, kmem_cache_t *, unsigned long));
// Allocate and free slabs of the cache
void *kmem_cache_alloc (kmem_cache_t *cachep, int flags);
void kmem_cache_free (kmem_cache_t *cachep, void *objp);

Two implementations of the allocator exist in the kernel, called SLAB and SLUB. The allocators differ in the way they keep track of slabs and objects, SLAB being more complex, SLUB being more streamlined. Usage statistics is available with both allocators.

> cat /proc/slabinfo
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> ...
ext4_inode_cache         49392      49392      1016           16              4
ext4_free_data             128        128        64           64              1
ext4_xattr                  92         92        88           46              1
blkdev_requests            273        273       376           21              2
blkdev_queue                30         30      2080           15              8
vm_area_struct            3520       3520       184           22              1
task_struct                160        160      1952           16              8
inode_cache              11899      11928       568           14              2
dentry                  401373     401373       192           21              1
...

[ This information is current for kernel 2.6.23. ]

References. 

  1. Jeff Bonwick: The Slab Allocator: An Object-Caching Kernel Memory Allocator.

3.2.3.2. Garbage Collectors

A traditional interface of a heap allocator offers methods for explicit allocating and freeing of blocks on the heap. Explict allocating and freeing, however, is prone to memory leaks when a process fails to free an allocated block even though it no longer uses it. A garbage collector replaces the explicit freeing of blocks with an automatic freeing of blocks that are no longer used.

A garbage collector needs to recognize when a block on the heap is no longer used. A garbage collectors determines whether a block is no longer used by determining whether it is reachable, that is, whether a process can follow a chain of references from statically or locally allocated variables, called roots, to reach the block.

Note that there is a difference between blocks that are no longer used and blocks that can no longer be used. This difference means that a garbage collector will fail to free blocks that can be used but are no longer used. In other words, a garbage collector exchanges the burden of having to explicitly free dynamically allocated variables for the burden of having to discard references to unused dynamically allocated variables. Normally, this is a benefit, because while freeing variables is always explicit, discarding references is often implicit.

3.2.3.2.1. Reference Tracing

Reference tracing algorithms. Copying. Mark and sweep. Mark and compact.

3.2.3.2.2. Reference Counting

Reference counting algorithms. Cycles. Distribution.

3.2.3.2.3. Distinguishing Generations

It has been observed that objects differ in lifetime. Especially, many young objects quickly die, while some old objects never die. Separating objects into generations therefore makes it possible to collect a generation at a time, especially, to frequently collect the younger generation using a copying collector and to seldomly collect the older generation using a mark and sweep collector. Collecting a generation at a time requires keeping remembered sets of references from other generations. Typically, all generations below certain age are collected, therefore only references from older to younger generations need to be kept in remembered sets.

[ Dave Ungar: Generation Scavenging: A Non-Disruptive High Performance Storage Reclamation Algorithm ] [ Richard E. Jones, Rafael Lins: Garbage Collection: Algorithms for Automatic Dynamic Memory Management ]

3.2.3.2.4. Additional Observations

Note that having garbage collection may simplify heap management. Copying and compacting tends to maintain heap in a single block, making it possible to always allocate new objects at the end of a heap, making allocation potentially as simple as a single pointer addition operation. Similarly, tracing does not concern dead objects, making deallocation potentially an empty operation. All of this gets a bit more complicated when destructors become involved though, for a call to a destructor is not an empty operation.

The asynchronous nature of calls to destructors makes them unsuitable for code that frees contented resources. A strict enforcement of referential integrity also requires garbage collection to handle situations where a call to a destructor associated with an unreachable block makes that block reachable again.



[1] The text uses specific constants. They should be used as rough guidelines. Particular allocator configuration may differ.