Starting a process means loading the program and initializing the state. The program typically expects to begin executing from a specific instruction with only the static variables initialized. The program then initializes the local and heap variables as it executes. Starting a process therefore boils down to loading the program code and the static variables, together called the program image, and setting the position of the currently executing instruction within the program to the instruction where the program expects to begin executing.
The first catch to starting a process comes with the question of who loads the program image. The typical solution of having a process load the program image of another process gets us to the question of who loads the program image of the very first process to be started. This process is called the bootstrap process and the act of starting the bootstrap process is called bootstrapping.
The program image of the bootstrap process is typically stored in the fixed memory of the computer by the manufacturer. Any of the ROM, PROM, EEPROM or FLASH type memory chips, which keep their contents even with the power switched off, can be used for this purpose. The processor of the computer is hardwired to start executing instructions from a specific address when the power is switched on, the fixed memory with the program image of the bootstrap process is therefore hardwired to reside on the same address.
Computers designed for a specific application purpose can have that purpose implemented by the bootstrap process. Such approach, however, would be too limiting for computers designed for general use, which is why the bootstrap process typically only initializes the hardware and starts another process, whose program image is loaded from whatever source the bootstrap process supports.
The IBM PC line of computers uses the Intel 80x86 line of processors, which start executing from address FFF...FFF0h (exact address depending on the address bus width and hence on the processor model). A fixed memory with BIOS program image resides at that address. The BIOS process initializes and tests the hardware of the computer as necessary and looks for the next process to start.
In the early models of the IBM PC line of computers, the BIOS process expected the program image of the next process to start to reside in the first sector of the first disk connected to the computer, have exactly 512 bytes in size and end with a two byte signature of 55AAh. The BIOS process loaded the program image into memory at address 7C00h and if the two byte signature was present, the BIOS process then begun executing the next process to start from address 7C00h.
In many cases, the fixed size of 512 bytes is too small for the program image of the next process to start. The next process to start is therefore yet another bootstrap process, which loads the program image of the next process to start. This repeats until the operating system itself is loaded. The reason for having a sequence of bootstrap processes rather than a single bootstrap process that loads the operating system straight away is that loading the program image of the operating system requires knowing the structure of the program image both on disk and in memory. This structure depends on the operating system itself and hardcoding the knowledge of the structure in a single bootstrap process which resides in fixed memory would limit the ability of the computer to load an arbitrary operating system.
The act of loading a program image is further complicated by the fact that the program image may have been constructed presuming that it will reside at a specific range of addresses, and may contain machine code instructions or static variables that refer to specific addresses from that range, using what is denoted as absolute addressing.
Figure 2.1. Absolute Addressing Example
Declaring and accessing a global variable in C.
static int i; // declare a global variable ... i = 0x12345678; // access the global variable
The C code compiled into Intel 80x86 assembler.
.comm i,4,4 ;declare i as 4 bytes aligned at 4 bytes boundary ... movl $0x12345678,i ;write value 12345678h into target address i
The assembler code compiled into Intel 80x86 machine code.
C705 ;movl C0950408 ;target address 080495C0h 78563412 ;value 12345678h
When a program image uses absolute addressing, it must be loaded at the specific range of addresses it has been constructed for. Unfortunately, it is often necessary to load program images at arbitrary ranges of addresses, for example when multiple program images are to share one address space. This requires adjusting the program image by fixing all machine code instructions and static variables that refer to specific addresses using absolute addressing. This process is called relocation.
The need for relocation can be alleviated by replacing absolute addressing, which stores addresses as absolute locations in machine code instructions, with relative addressing, which stores addresses as relative distances in machine code instructions. The program image is said to contain position independent code when it does not need relocation. Constructing position independent code usually comes at a price, however, because in some cases, relative addressing might be more cumbersome than absolute addressing.
Figure 2.2. Relative Addressing Example
Declaring and accessing a global variable in C.
static int i; // declare a global variable ... i = 0; // access the global variable
The C code compiled into position independent Intel 80x86 assembler.
.comm i,4,4 ;declare i as 4 bytes aligned at 4 bytes boundary ... call __get_thunk ;get program starting address in ECX addl $_GOT_,%ecx ;calculate address of global table of addresses in ECX movl $0,i@GOT(%ecx) ;write value 0 into target address i relative from ECX
The assembler code compiled into position independent Intel 80x86 machine code.
E8 ;call 1C000000 ;target address 0000001Ch distant from here 81C1 ;addl target ECX D9110000 ;value 000011D9h C781 ;movl target address relative from ECX 20000000 ;target address 00000020h distant from ECX 00000000 ;value 00000000h
CP/M avoids the need for relocation by expecting a program to always start at the same address, namely 100h. The file with the program image consisted of the program code and the static variables, stored exactly in the same form as when the program is executing in memory.
When the program image was to be stored on media that was originally designed for storing text, such as some types of paper tapes or punch cards, formats such as Intel HEX were used. A program image in Intel HEX consisted of lines starting with a colon and followed by a string of hexadecimal digits.
:LLAAAATTxxxxCC
In the string, LL is the length of the data, AAAA is the address of the data in memory, TT indicates the last line, the data itself is followed by checksum CC. The data consists of the program code and the static variables, stored exactly in the same form as when the program is executing in memory.
For small programs, DOS employs the segmentation support of the Intel 80x86 processors to avoid the need for relocation. A program is expected to fit into a single segment and to always start at the same address within the segment, namely 100h. The file with the program image consisted of the program code and the static variables, stored exactly in the same form as when the program is executing in memory.
For large programs, DOS introduced the EXE format. Besides the program code and the static variables, the file with the program image also contained a relocation table. The relocation table was a simple list of locations within the program image that need to be adjusted, the adjustment being a simple addition of the program base address to the location. Besides the relocation table, the header of the file also contained the required memory size and the relative starting address.
Figure 2.3. DOS EXE Format
Offset Length Contents ---------------------------------------------------------------------------- 00h 2 Magic (0AA55h) 02h 2 Length of last block 04h 2 Length of file in 512B blocks (L) 06h 2 Number of relocation table entries (R) 08h 2 Length of header in 16B blocks (H) 0Ah 2 Minimum memory beyond program image in 16B blocks 0Ch 2 Maximum memory beyond program image in 16B blocks 0Eh 4 Initial stack pointer setting (SS:SP) 12h 2 File checksum 14h 4 Initial program counter setting (CS:IP) 18h 2 Offset of relocation table (1Ch) 1Ah 2 Overlay number 1Ch R*4h Relocation table entries H*10h L*200h Program image
It is common for many programs to share the same libraries. The libraries can be linked to the program either statically, during compilation, or dynamically, during execution. Both approaches can have advantages, static linking creates independent program images robust to system upgrades, dynamic linking creates small program images efficient in memory usage. Both approaches require program image formats that support linking by listing exported symbols that the program image provides and external symbols that the program image requires.
ELF is a binary executable file format originally developed for UNIX System V. Gradually, it became the binary executable file format of choice for most UNIX systems. It supports multiple system architectures and can store both modules for linking and programs for execution. It can also carry supplemental information such as debugging data.
An ELF file partitions data into sections and segments. Sections provide logical separation - there are separate sections for code, symbols, relocations, debugging, and so on. Segments provide structure for execution - there are segments to be loaded in memory, segments for dynamic linking, and so on. Sections and segments are defined in header tables that point to data elsewhere in the file. Sections and segments can also point to the same data and thus overlap.
The ELF header is the very first part of an ELF file and describes the structure of the file. Besides the usual magic number that identifies the ELF format, it tells the exact type of the file including address size and data encoding, and the processor architecture that the program is for. Other important fields include the starting address of the program.
> readelf --file-header /bin/bash ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x41d238 Start of program headers: 64 (bytes into file) Start of section headers: 964960 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 10 Size of section headers: 64 (bytes) Number of section headers: 32 Section header string table index: 31
> readelf --file-header /lib/libc.so.6 ELF Header: Magic: 7f 45 4c 46 01 01 01 03 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - GNU ABI Version: 0 Type: DYN (Shared object file) Machine: Intel 80386 Version: 0x1 Entry point address: 0x44564790 Start of program headers: 52 (bytes into file) Start of section headers: 2009952 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 10 Size of section headers: 40 (bytes) Number of section headers: 43 Section header string table index: 42
The section header table lists all the sections of the file. Each section has a name, a type, a position and length within the file, and flags.
> readelf --sections /lib/libc.so.6 There are 43 section headers, starting at offset 0x1eab60: Section Headers: [Nr] Name Type Addr Off Size ES Flg Lk Inf Al ... [ 9] .rel.dyn REL 4455f3e4 0143e4 002a18 08 A 4 0 4 [10] .rel.plt REL 44561dfc 016dfc 000058 08 A 4 11 4 [11] .plt PROGBITS 44561e60 016e60 0000c0 04 AX 0 0 16 [12] .text PROGBITS 44561f20 016f20 14010c 00 AX 0 0 16 ... [32] .data PROGBITS 446f9040 1ad040 000e7c 00 WA 0 0 32 [33] .bss NOBITS 446f9ec0 1adebc 002bfc 00 WA 0 0 32 [34] .comment PROGBITS 00000000 1adebc 00002c 01 MS 0 0 1 [35] .note.stapsdt NOTE 00000000 1adee8 0002c4 00 0 0 4 [36] .symtab SYMTAB 00000000 1ae1ac 021880 10 37 6229 4 [37] .strtab STRTAB 00000000 1cfa2c 01a786 00 0 0 1 ... Key to Flags: W (write), A (alloc), X (execute), M (merge), S (strings) I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown) O (extra OS processing required) o (OS specific), p (processor specific)
Important sections typically present in a file include:
A section that represents the uninitialized memory of the program.
A section that contains the static variables.
A section that contains the program code.
Sections that contain the program code responsible for initialization and termination.
Sections that contain the symbol tables for static and dynamic linking. Symbol tables list symbols defined by the program. A symbol is defined by its name, value, size and type.
Sections that contain the string tables for static and dynamic linking. String tables list all strings used in the file. A string is referred to using its index in the table rather than quoted.
Sections that contain the relocation information. Relocations are defined by their position, size and type.
> readelf --relocs /lib/libc.so.6 Relocation section '.rel.dyn' at offset 0x134e4 contains 1457 entries: Offset Info Type Sym.Value Sym. Name 436f71b0 00000008 R_386_RELATIVE 436f8e74 0000000e R_386_TLS_TPOFF 436f8e90 00058206 R_386_GLOB_DAT 436f9d7c stderr 436f9008 0004de07 R_386_JUMP_SLOT 435c3b70 malloc ...
The program header table lists all the segments of the file. Each segment has a type, a position and length in the file, an address and length in memory, and flags. The content of a segment is made up of sections. Examples of important types include:
A segment that will be loaded into memory. Data within a loadable segment must be aligned to page size so that the segment can be mapped rather than loaded.
A segment that contains the dynamic linking information. The segment is used by the dynamic loader that is specified in the interpreter segment.
A segment that identifies the program interpreter. When specified, the program interpreter is loaded into memory. The program interpreter is responsible for executing the program itself.
> readelf --segments /bin/bash Elf file type is EXEC (Executable file) Entry point 0x41d238 There are 10 program headers, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align PHDR 0x0000000000000040 0x0000000000400040 0x0000000000400040 0x0000000000000230 0x0000000000000230 R E 8 INTERP 0x0000000000000270 0x0000000000400270 0x0000000000400270 0x000000000000001c 0x000000000000001c R 1 [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2] LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000 0x00000000000d9cd4 0x00000000000d9cd4 R E 200000 DYNAMIC 0x00000000000d9df0 0x00000000006d9df0 0x00000000006d9df0 0x00000000000001f0 0x00000000000001f0 RW 8 STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 RW 8 ... Section to Segment mapping: Segment Sections... 00 01 .interp 02 .interp .note .dynsym .rela .init .fini .plt .text ... ...
> readelf --dynamic /bin/bash Dynamic section at offset 0xd9df0 contains 30 entries: Tag Type Name/Value 0x0000000000000001 (NEEDED) Shared library: [libtinfo.so.5] 0x0000000000000001 (NEEDED) Shared library: [libdl.so.2] 0x0000000000000001 (NEEDED) Shared library: [libc.so.6] 0x000000000000000c (INIT) 0x41adb0 0x000000000000000d (FINI) 0x4a6374 0x0000000000000019 (INIT_ARRAY) 0x6d9dd8 0x0000000000000005 (STRTAB) 0x8e2b30 0x0000000000000006 (SYMTAB) 0x403988 ...
The ELF file format also supports special techniques used to minimize the number of pages modified during relocation and linking. These include using global offset table and procedure linkage table.
The global offset table is a table created by the dynamic linker that lists all the absolute addresses that the program needs to access. Rather than accessing the absolute addresses directly, the program uses relative addressing to read the address in the global offset table.
The procedure linkage table is a table created by the dynamic linker that wraps all the absolute addresses that the program needs to call. Rather than calling the absolute addresses directly, the program uses relative addressing to call the wrapper in the procedure linkage table.
A process needs a way to request services of the operating system. The services provided by the operating system are often similar to the services provided by the libraries, and, in the simplest case, are also called similarly. The situation becomes more complex when access to protected resources, such as hardware devices or application memory, must be considered.
Typically, the privileges that govern access to protected resources are enforced by the processor. Depending on the privileges of the currently executing code, the processor decides whether to allow executing instructions that are used to access the protected resources. To prevent malicious code from accessing protected resources, constraints are imposed on the means by which the currently executing code can change its privileges, as well as the means by which less privileged code can call more privileged code.
The operating system posesses various privileges that allow it to access protected resources. Requesting services of the operating system therefore means calling more privileged code from less privileged code and must be subject to the constraints that prevent malicious code from accessing protected resources. These constraints are met by the system call interface of the operating system.
The system call interface must be efficient. Depending on the processor, this can become an issue especially when calling more privileged code from less privileged code, because the constraints imposed by the processor can make the call slow or make the copying of arguments necessary.
The system call interface must be robust. Malicious code should not be able to trick the operating system into accessing protected resources on its behalf or into denying service.
The system call interface must be flexible. Features such as wrapping or monitoring services provided by the operating system should be available. Adding new services without changing the system call interface for the existing services should be possible.
Note that services provided through the system call interface are typically wrapped by libraries and thus look as services provided by libraries. This makes it possible to call all services in a uniform way.
CP/M run on processors that did not distinguish any privileges. Its system call interface therefore did not have to solve many of the issues related to efficiency and robustness that concern contemporary systems. Instead, the system call interface has been designed with binary compatibility in mind.
When calling the BDOS module, the application placed a number identifying the requested service in register C, other arguments of the requested service in other registers, and called the BDOS entry point at address 5. The entry point contained a jump to the real BDOS entry point which could differ from system to system. The services provided by BDOS included console I/O and FCB based file operations.
Figure 2.4. CP/M BDOS System Call Example
ReadKey: mvi c,1 ; keyboard read service call 5 ; call BDOS entry point cpi a,0Dh ; is returned key code ENTER ? jnz ReadKey ; repeat keyboard read until it is
When calling the BIOS module, the application placed arguments of the requested service in registers and called the BIOS entry point for the specific service. The entry point for the specific service could differ from system to system, but its distance from the beginning of the BIOS module was the same for all systems.
Figure 2.5. CP/M BIOS System Call Entry Points
jmp BOOT ;cold boot jmp WBOOT ;warm boot jmp CONST ;console status jmp CONIN ;console input ... jmp HOME ;disk head to track 0 ... jmp SETDMA ;set memory transfer address jmp READ ;read sector jmp WRITE ;write sector
The Intel 80x86 processors traditionally serve as an example of why calling more privileged code from less privileged code can be slow:
On Intel 80286, an average MOV instruction took 2 clock cycles to execute. A call that changed the privilege level took over 80 clock cycles to execute. A call that switched the task took over 180 clock cycles to execute.
On Intel 80386, an average MOV instruction took 2 clock cycles to execute. A call that changed the privilege level took over 80 clock cycles to execute. A call that switched the task took over 300 clock cycles to execute.
Modern Pentium processors introduce the SYSENTER and SYSEXIT instructions for efficient implementation of the system call interface:
The SYSENTER instruction sets the stack pointer and instruction pointer registers to values specified by the operating system in advance to point to the operating system code executing at the most privileged level.
The SYSEXIT instruction sets the stack pointer and instruction pointer registers to values specified by the operating system in registers ECX and EDX to point to the application code executing at the least privileged level.
Note that the SYSENTER and SYSEXIT instructions do not form a complementary pair that would take care of saving and restoring the stack pointer and instruction pointer registers the way CALL and RET instructions do. It is up to the code using the SYSENTER and SYSEXIT instructions to do that.
The libraries wrapping the system call interface are called in the same way as any other libraries.
Figure 2.6. Library System Call Example
ssize_t read (int fd, void *buf, size_t count); ... int hFile; ssize_t iCount; char abBuffer [1024]; iCount = read (hFile, &abBuffer, sizeof (abBuffer)); pushl $1024 ;sizeof (abBuffer) pushl $abBuffer ;&abBuffer pushl hFile ;hFile call read@plt ;call the library addl $12,%esp ;remove arguments from stack movl %eax,iCount ;save result
The system call interface uses either the interrupt vector 80h, which is configured to lead to the kernel through a trap gate, or the SYSENTER and SYSEXIT instructions. In both cases, the EAX register contains a number identifying the requested service and other registers contain other arguments of the requested service.
Since the system call interface is typically called from within the system libraries, having two versions of the system call code would require having two versions of the libraries that contain the system call code. To avoid this redundancy, the system call interface is wrapped by a virtual library called linux-gate, which does not exist as a file, but is inserted by the kernel into the memory map of every process.
Figure 2.8. Linux Gate Library Based On SYSENTER And SYSEXIT
__kernel_vsyscall: push %ecx push %edx push %ebp __resume: mov %esp,%ebp sysenter jmp __resume ;hack for syscall resume __return: pop %ebp ;this is where pop %edx ;the SYSEXIT pop %ecx ;returns ret
References.Â
Johan Petersson: What Is linux-gate.so.1 ? http://www.trilithium.com/johan/2005/08/linux-gate
Linus Torvalds: System Call Restart. http://lkml.org/lkml/2002/12/18/218
A simplified example of reading a file using syslets is copied from Molnar.
References.Â
Ingo Molnar: Syslet and Threadlet Patches. http://people.redhat.com/mingo/syslet-patches
The libraries wrapping the system call interface are called in the same way as any other libraries.
Figure 2.9. Library System Call Example
int MessageBox ( HWND hwndOwner, LPCTSTR lpszText, LPCTSTR lpszTitle, UINT fuStyle); ... MessageBox (0, zMessageText, zWindowTitle, MB_OK || MB_SYSTEMMODAL || MB_ICONHAND); push MBOK or MB_SYSTEMMODAL or MB_ICONHAND push offset zWindowTitle push offset zMessageTest push 0 call MessageBoxA ;call the library add esp,16 ;remove arguments from stack
The system call interface uses either the interrupt vector 2Eh or the SYSENTER and SYSEXIT instructions. In both cases, the EAX register contains a number identifying the requested service and the EDX register points to a stack frame holding arguments of the requested service.