2.1.2. Starting A Process

Starting a process means loading the program and initializing the state. The program typically expects to begin executing from a specific instruction with only the static variables initialized. The program then initializes the local and heap variables as it executes. Starting a process therefore boils down to loading the program code and the static variables, together called the program image, and setting the position of the currently executing instruction within the program to the instruction where the program expects to begin executing.

2.1.2.1. Bootstrapping

The first catch to starting a process comes with the question of who loads the program image. The typical solution of having a process load the program image of another process gets us to the question of who loads the program image of the very first process to be started. This process is called the bootstrap process and the act of starting the bootstrap process is called bootstrapping.

The program image of the bootstrap process is typically stored in the fixed memory of the computer by the manufacturer. Any of the ROM, PROM, EEPROM or FLASH type memory chips, which keep their contents even with the power switched off, can be used for this purpose. The processor of the computer is hardwired to start executing instructions from a specific address when the power is switched on, the fixed memory with the program image of the bootstrap process is therefore hardwired to reside on the same address.

Computers designed for a specific application purpose can have that purpose implemented by the bootstrap process. Such approach, however, would be too limiting for computers designed for general use, which is why the bootstrap process typically only initializes the hardware and starts another process, whose program image is loaded from whatever source the bootstrap process supports.

2.1.2.1.1. Example: Booting IBM PC

The IBM PC line of computers uses the Intel 80x86 line of processors, which start executing from address FFF...FFF0h (exact address depending on the address bus width and hence on the processor model). A fixed memory with BIOS program image resides at that address. The BIOS process initializes and tests the hardware of the computer as necessary and looks for the next process to start.

In the early models of the IBM PC line of computers, the BIOS process expected the program image of the next process to start to reside in the first sector of the first disk connected to the computer, have exactly 512 bytes in size and end with a two byte signature of 55AAh. The BIOS process loaded the program image into memory at address 7C00h and if the two byte signature was present, the BIOS process then begun executing the next process to start from address 7C00h.

In many cases, the fixed size of 512 bytes is too small for the program image of the next process to start. The next process to start is therefore yet another bootstrap process, which loads the program image of the next process to start. This repeats until the operating system itself is loaded. The reason for having a sequence of bootstrap processes rather than a single bootstrap process that loads the operating system straight away is that loading the program image of the operating system requires knowing the structure of the program image both on disk and in memory. This structure depends on the operating system itself and hardcoding the knowledge of the structure in a single bootstrap process which resides in fixed memory would limit the ability of the computer to load an arbitrary operating system.

2.1.2.1.2. Example: Booting UEFI

To be done.

2.1.2.2. Relocating

The act of loading a program image is further complicated by the fact that the program image may have been constructed presuming that it will reside at a specific range of addresses, and may contain machine code instructions or static variables that refer to specific addresses from that range, using what is denoted as absolute addressing.

Figure 2.1. Absolute Addressing Example

Declaring and accessing a global variable in C.

static int i;           // declare a global variable
...
i = 0x12345678;         // access the global variable

The C code compiled into Intel 80x86 assembler.

.comm i,4,4             ;declare i as 4 bytes aligned at 4 bytes boundary
...
movl $0x12345678,i      ;write value 12345678h into target address i

The assembler code compiled into Intel 80x86 machine code.

C705                    ;movl
C0950408                ;target address 080495C0h
78563412                ;value 12345678h

When a program image uses absolute addressing, it must be loaded at the specific range of addresses it has been constructed for. Unfortunately, it is often necessary to load program images at arbitrary ranges of addresses, for example when multiple program images are to share one address space. This requires adjusting the program image by fixing all machine code instructions and static variables that refer to specific addresses using absolute addressing. This process is called relocation.

The need for relocation can be alleviated by replacing absolute addressing, which stores addresses as absolute locations in machine code instructions, with relative addressing, which stores addresses as relative distances in machine code instructions. The program image is said to contain position independent code when it does not need relocation. Constructing position independent code usually comes at a price, however, because in some cases, relative addressing might be more cumbersome than absolute addressing.

Figure 2.2. Relative Addressing Example

Declaring and accessing a global variable in C.

static int i;           // declare a global variable
...
i = 0;                  // access the global variable

The C code compiled into position independent Intel 80x86 assembler.

.comm i,4,4             ;declare i as 4 bytes aligned at 4 bytes boundary
...
call __get_thunk        ;get program starting address in ECX
addl $_GOT_,%ecx        ;calculate address of global table of addresses in ECX
movl $0,i@GOT(%ecx)     ;write value 0 into target address i relative from ECX

The assembler code compiled into position independent Intel 80x86 machine code.

E8                      ;call
1C000000                ;target address 0000001Ch distant from here
81C1                    ;addl target ECX
D9110000                ;value 000011D9h
C781                    ;movl target address relative from ECX
20000000                ;target address 00000020h distant from ECX
00000000                ;value 00000000h

2.1.2.2.1. Example: Program Image In CP/M

CP/M avoids the need for relocation by expecting a program to always start at the same address, namely 100h. The file with the program image consisted of the program code and the static variables, stored exactly in the same form as when the program is executing in memory.

2.1.2.2.2. Example: Program Image In Intel HEX

When the program image was to be stored on media that was originally designed for storing text, such as some types of paper tapes or punch cards, formats such as Intel HEX were used. A program image in Intel HEX consisted of lines starting with a colon and followed by a string of hexadecimal digits.

:LLAAAATTxxxxCC

In the string, LL is the length of the data, AAAA is the address of the data in memory, TT indicates the last line, the data itself is followed by checksum CC. The data consists of the program code and the static variables, stored exactly in the same form as when the program is executing in memory.

2.1.2.2.3. Example: Program Image In DOS

For small programs, DOS employs the segmentation support of the Intel 80x86 processors to avoid the need for relocation. A program is expected to fit into a single segment and to always start at the same address within the segment, namely 100h. The file with the program image consisted of the program code and the static variables, stored exactly in the same form as when the program is executing in memory.

For large programs, DOS introduced the EXE format. Besides the program code and the static variables, the file with the program image also contained a relocation table. The relocation table was a simple list of locations within the program image that need to be adjusted, the adjustment being a simple addition of the program base address to the location. Besides the relocation table, the header of the file also contained the required memory size and the relative starting address.

Figure 2.3. DOS EXE Format

Offset  Length  Contents
----------------------------------------------------------------------------
00h     2       Magic (0AA55h)
02h     2       Length of last block
04h     2       Length of file in 512B blocks (L)
06h     2       Number of relocation table entries (R)
08h     2       Length of header in 16B blocks (H)
0Ah     2       Minimum memory beyond program image in 16B blocks
0Ch     2       Maximum memory beyond program image in 16B blocks
0Eh     4       Initial stack pointer setting (SS:SP)
12h     2       File checksum
14h     4       Initial program counter setting (CS:IP)
18h     2       Offset of relocation table (1Ch)
1Ah     2       Overlay number
1Ch     R*4h    Relocation table entries
H*10h   L*200h  Program image

2.1.2.3. Linking

It is common for many programs to share the same libraries. The libraries can be linked to the program either statically, during compilation, or dynamically, during execution. Both approaches can have advantages, static linking creates independent program images robust to system upgrades, dynamic linking creates small program images efficient in memory usage. Both approaches require program image formats that support linking by listing exported symbols that the program image provides and external symbols that the program image requires.

2.1.2.3.1. Example: Executable And Linking Format

ELF is a binary executable file format originally developed for UNIX System V. Gradually, it became the binary executable file format of choice for most UNIX systems. It supports multiple system architectures and can store both modules for linking and programs for execution. It can also carry supplemental information such as debugging data.

An ELF file partitions data into sections and segments. Sections provide logical separation - there are separate sections for code, symbols, relocations, debugging, and so on. Segments provide structure for execution - there are segments to be loaded in memory, segments for dynamic linking, and so on. Sections and segments are defined in header tables that point to data elsewhere in the file. Sections and segments can also point to the same data and thus overlap.

2.1.2.3.1.1. Headers

The ELF header is the very first part of an ELF file and describes the structure of the file. Besides the usual magic number that identifies the ELF format, it tells the exact type of the file including address size and data encoding, and the processor architecture that the program is for. Other important fields include the starting address of the program.

> readelf --file-header /bin/bash

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x41d238
  Start of program headers:          64 (bytes into file)
  Start of section headers:          964960 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         10
  Size of section headers:           64 (bytes)
  Number of section headers:         32
  Section header string table index: 31
> readelf --file-header /lib/libc.so.6

ELF Header:
  Magic:   7f 45 4c 46 01 01 01 03 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - GNU
  ABI Version:                       0
  Type:                              DYN (Shared object file)
  Machine:                           Intel 80386
  Version:                           0x1
  Entry point address:               0x44564790
  Start of program headers:          52 (bytes into file)
  Start of section headers:          2009952 (bytes into file)
  Flags:                             0x0
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         10
  Size of section headers:           40 (bytes)
  Number of section headers:         43
  Section header string table index: 42
2.1.2.3.1.2. Sections

The section header table lists all the sections of the file. Each section has a name, a type, a position and length within the file, and flags.

> readelf --sections /lib/libc.so.6

There are 43 section headers, starting at offset 0x1eab60:

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
...
  [ 9] .rel.dyn          REL             4455f3e4 0143e4 002a18 08   A  4   0  4
  [10] .rel.plt          REL             44561dfc 016dfc 000058 08   A  4  11  4
  [11] .plt              PROGBITS        44561e60 016e60 0000c0 04  AX  0   0 16
  [12] .text             PROGBITS        44561f20 016f20 14010c 00  AX  0   0 16
...
  [32] .data             PROGBITS        446f9040 1ad040 000e7c 00  WA  0   0 32
  [33] .bss              NOBITS          446f9ec0 1adebc 002bfc 00  WA  0   0 32
  [34] .comment          PROGBITS        00000000 1adebc 00002c 01  MS  0   0  1
  [35] .note.stapsdt     NOTE            00000000 1adee8 0002c4 00      0   0  4
  [36] .symtab           SYMTAB          00000000 1ae1ac 021880 10     37 6229  4
  [37] .strtab           STRTAB          00000000 1cfa2c 01a786 00      0   0  1
...
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings)
  I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
  O (extra OS processing required) o (OS specific), p (processor specific)

Important sections typically present in a file include:

bss

A section that represents the uninitialized memory of the program.

data

A section that contains the static variables.

text

A section that contains the program code.

init and fini

Sections that contain the program code responsible for initialization and termination.

symtab and dynsym

Sections that contain the symbol tables for static and dynamic linking. Symbol tables list symbols defined by the program. A symbol is defined by its name, value, size and type.

strtab and dynstr

Sections that contain the string tables for static and dynamic linking. String tables list all strings used in the file. A string is referred to using its index in the table rather than quoted.

rel

Sections that contain the relocation information. Relocations are defined by their position, size and type.

> readelf --relocs /lib/libc.so.6

Relocation section '.rel.dyn' at offset 0x134e4 contains 1457 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
436f71b0  00000008 R_386_RELATIVE
436f8e74  0000000e R_386_TLS_TPOFF
436f8e90  00058206 R_386_GLOB_DAT    436f9d7c   stderr
436f9008  0004de07 R_386_JUMP_SLOT   435c3b70   malloc
...
2.1.2.3.1.3. Segments

The program header table lists all the segments of the file. Each segment has a type, a position and length in the file, an address and length in memory, and flags. The content of a segment is made up of sections. Examples of important types include:

loadable

A segment that will be loaded into memory. Data within a loadable segment must be aligned to page size so that the segment can be mapped rather than loaded.

dynamic

A segment that contains the dynamic linking information. The segment is used by the dynamic loader that is specified in the interpreter segment.

interpreter

A segment that identifies the program interpreter. When specified, the program interpreter is loaded into memory. The program interpreter is responsible for executing the program itself.

> readelf --segments /bin/bash

Elf file type is EXEC (Executable file)
Entry point 0x41d238
There are 10 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x0000000000000230 0x0000000000000230  R E    8
  INTERP         0x0000000000000270 0x0000000000400270 0x0000000000400270
                 0x000000000000001c 0x000000000000001c  R      1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x00000000000d9cd4 0x00000000000d9cd4  R E    200000
  DYNAMIC        0x00000000000d9df0 0x00000000006d9df0 0x00000000006d9df0
                 0x00000000000001f0 0x00000000000001f0  RW     8
  STACK          0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     8
...

 Section to Segment mapping:
  Segment Sections...
   00
   01     .interp
   02     .interp .note .dynsym .rela .init .fini .plt .text ...
...
> readelf --dynamic /bin/bash

Dynamic section at offset 0xd9df0 contains 30 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libtinfo.so.5]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x41adb0
 0x000000000000000d (FINI)               0x4a6374
 0x0000000000000019 (INIT_ARRAY)         0x6d9dd8
 0x0000000000000005 (STRTAB)             0x8e2b30
 0x0000000000000006 (SYMTAB)             0x403988
...
2.1.2.3.1.4. Miscellanea

The ELF file format also supports special techniques used to minimize the number of pages modified during relocation and linking. These include using global offset table and procedure linkage table.

The global offset table is a table created by the dynamic linker that lists all the absolute addresses that the program needs to access. Rather than accessing the absolute addresses directly, the program uses relative addressing to read the address in the global offset table.

The procedure linkage table is a table created by the dynamic linker that wraps all the absolute addresses that the program needs to call. Rather than calling the absolute addresses directly, the program uses relative addressing to call the wrapper in the procedure linkage table.

2.1.2.4. Calling Operating System

A process needs a way to request services of the operating system. The services provided by the operating system are often similar to the services provided by the libraries, and, in the simplest case, are also called similarly. The situation becomes more complex when access to protected resources, such as hardware devices or application memory, must be considered.

Typically, the privileges that govern access to protected resources are enforced by the processor. Depending on the privileges of the currently executing code, the processor decides whether to allow executing instructions that are used to access the protected resources. To prevent malicious code from accessing protected resources, constraints are imposed on the means by which the currently executing code can change its privileges, as well as the means by which less privileged code can call more privileged code.

The operating system posesses various privileges that allow it to access protected resources. Requesting services of the operating system therefore means calling more privileged code from less privileged code and must be subject to the constraints that prevent malicious code from accessing protected resources. These constraints are met by the system call interface of the operating system.

  • The system call interface must be efficient. Depending on the processor, this can become an issue especially when calling more privileged code from less privileged code, because the constraints imposed by the processor can make the call slow or make the copying of arguments necessary.

  • The system call interface must be robust. Malicious code should not be able to trick the operating system into accessing protected resources on its behalf or into denying service.

  • The system call interface must be flexible. Features such as wrapping or monitoring services provided by the operating system should be available. Adding new services without changing the system call interface for the existing services should be possible.

Note that services provided through the system call interface are typically wrapped by libraries and thus look as services provided by libraries. This makes it possible to call all services in a uniform way.

2.1.2.4.1. Example: CP/M System Call Interface

CP/M run on processors that did not distinguish any privileges. Its system call interface therefore did not have to solve many of the issues related to efficiency and robustness that concern contemporary systems. Instead, the system call interface has been designed with binary compatibility in mind.

When calling the BDOS module, the application placed a number identifying the requested service in register C, other arguments of the requested service in other registers, and called the BDOS entry point at address 5. The entry point contained a jump to the real BDOS entry point which could differ from system to system. The services provided by BDOS included console I/O and FCB based file operations.

Figure 2.4. CP/M BDOS System Call Example

ReadKey:    mvi     c,1         ; keyboard read service
            call    5           ; call BDOS entry point
            cpi     a,0Dh       ; is returned key code ENTER ?
            jnz     ReadKey     ; repeat keyboard read until it is

When calling the BIOS module, the application placed arguments of the requested service in registers and called the BIOS entry point for the specific service. The entry point for the specific service could differ from system to system, but its distance from the beginning of the BIOS module was the same for all systems.

Figure 2.5. CP/M BIOS System Call Entry Points

jmp     BOOT            ;cold boot
jmp     WBOOT           ;warm boot
jmp     CONST           ;console status
jmp     CONIN           ;console input
...
jmp     HOME            ;disk head to track 0
...
jmp     SETDMA          ;set memory transfer address
jmp     READ            ;read sector
jmp     WRITE           ;write sector

2.1.2.4.2. Example: Intel 80x86 Processor Privileges

The Intel 80x86 processors traditionally serve as an example of why calling more privileged code from less privileged code can be slow:

  • On Intel 80286, an average MOV instruction took 2 clock cycles to execute. A call that changed the privilege level took over 80 clock cycles to execute. A call that switched the task took over 180 clock cycles to execute.

  • On Intel 80386, an average MOV instruction took 2 clock cycles to execute. A call that changed the privilege level took over 80 clock cycles to execute. A call that switched the task took over 300 clock cycles to execute.

Modern Pentium processors introduce the SYSENTER and SYSEXIT instructions for efficient implementation of the system call interface:

  • The SYSENTER instruction sets the stack pointer and instruction pointer registers to values specified by the operating system in advance to point to the operating system code executing at the most privileged level.

  • The SYSEXIT instruction sets the stack pointer and instruction pointer registers to values specified by the operating system in registers ECX and EDX to point to the application code executing at the least privileged level.

Note that the SYSENTER and SYSEXIT instructions do not form a complementary pair that would take care of saving and restoring the stack pointer and instruction pointer registers the way CALL and RET instructions do. It is up to the code using the SYSENTER and SYSEXIT instructions to do that.

2.1.2.4.3. Example: Linux System Call API On Intel 80x86

The libraries wrapping the system call interface are called in the same way as any other libraries.

Figure 2.6. Library System Call Example

ssize_t read (int fd, void *buf, size_t count);
...
int hFile;
ssize_t iCount;
char abBuffer [1024];
iCount = read (hFile, &abBuffer, sizeof (abBuffer));

pushl $1024             ;sizeof (abBuffer)
pushl $abBuffer         ;&abBuffer
pushl hFile             ;hFile
call  read@plt          ;call the library
addl  $12,%esp          ;remove arguments from stack
movl  %eax,iCount       ;save result

The system call interface uses either the interrupt vector 80h, which is configured to lead to the kernel through a trap gate, or the SYSENTER and SYSEXIT instructions. In both cases, the EAX register contains a number identifying the requested service and other registers contain other arguments of the requested service.

Since the system call interface is typically called from within the system libraries, having two versions of the system call code would require having two versions of the libraries that contain the system call code. To avoid this redundancy, the system call interface is wrapped by a virtual library called linux-gate, which does not exist as a file, but is inserted by the kernel into the memory map of every process.

Figure 2.7. Linux Gate Library Based On INT 80h

__kernel_vsyscall: int $0x80
                   ret

Figure 2.8. Linux Gate Library Based On SYSENTER And SYSEXIT

__kernel_vsyscall: push %ecx
                   push %edx
                   push %ebp
__resume:          mov  %esp,%ebp
                   sysenter

                   jmp  __resume        ;hack for syscall resume

__return:          pop  %ebp            ;this is where
                   pop  %edx             ;the SYSEXIT
                   pop  %ecx              ;returns
                   ret

References. 

  1. Johan Petersson: What Is linux-gate.so.1 ? http://www.trilithium.com/johan/2005/08/linux-gate

  2. Linus Torvalds: System Call Restart. http://lkml.org/lkml/2002/12/18/218

2.1.2.4.4. Example: Linux Syslet API

A simplified example of reading a file using syslets is copied from Molnar.

References. 

  1. Ingo Molnar: Syslet and Threadlet Patches. http://people.redhat.com/mingo/syslet-patches

2.1.2.4.5. Example: Windows System Call API On Intel 80x86

The libraries wrapping the system call interface are called in the same way as any other libraries.

Figure 2.9. Library System Call Example

int MessageBox (
    HWND hwndOwner,
    LPCTSTR lpszText,
    LPCTSTR lpszTitle,
    UINT fuStyle);
...
MessageBox (0, zMessageText, zWindowTitle, MB_OK || MB_SYSTEMMODAL || MB_ICONHAND);

push MBOK or MB_SYSTEMMODAL or MB_ICONHAND
push offset zWindowTitle
push offset zMessageTest
push 0
call MessageBoxA                ;call the library
add  esp,16                     ;remove arguments from stack

The system call interface uses either the interrupt vector 2Eh or the SYSENTER and SYSEXIT instructions. In both cases, the EAX register contains a number identifying the requested service and the EDX register points to a stack frame holding arguments of the requested service.