4.2.7. Disk Storage Devices

A disk is a device that can read and write fixed length sectors. Various flavors of disks differ in how sectors are organized. A hard disk has multiple surfaces where sectors of typically 512 bytes are organized in concentric tracks. A floppy disk has one or two surfaces where sectors of typically 512 bytes are organized in concentric tracks. A compact disk has one surface where sectors of typically 2048 bytes are organized in a spiral track.

4.2.7.1. Addressing

Initially, sectors on a disk were addressed using the surface, track and sector numbers. This had several problems. First, implementations of the ATA hardware interface and the BIOS software interface typically limited the number of surfaces to 16, the number of cylinders to 1024, and the number of sectors to 63. Second, the fact that the length of a cylinder depends on the distance from the center of the disk makes it advantageous to vary the number of sectors per cylinder. Lately, sectors on a disk are therefore addressed using a logical block address that numbers sectors sequentially.

4.2.7.1.1. Example: ATA Disk Access

An ATA disk denotes a disk using the Advanced Technology Attachment (ATA) or the Advanced Technology Attachment with Packet Interface (ATAPI) standard, which describes an interface between the disk and the computer. The ATA standard allows the disk to be accessed using the command block registers, the ATAPI standard allows the disk to be accessed using the command block registers or the packet commands.

The command block registers interface relies on a number of registers, including the Cylinder High, Cylinder Low, Device/Head, Sector Count, Sector Number, Command, Status, Features, Error, and Data registers. Issuing a command entails reading the Status register until its BSY and DRDY bits are cleared, which indicates that the disk is ready, then writing the other registers with the required parameter, and finally writing the Command register with the required command. When the Command register is written, the disk will set the Status register to indicate that a command is being executed, execute the command, and finally generate an interrupt to indicate that the command has been executed. Data are transferred either through the Data register or using Direct Memory Access.

The packet commands interface relies on the command block registers interface to issue a command that sends a data packet, which is interpreted as another command. The packet commands interface is suitable for complex commands that cannot be described using the command block registers interface.

4.2.7.2. Request Queuing

Because of the mechanical properties of the disk, the relative speed of the computer and the disk must be considered. A problem arises when the computer issues requests for accessing consecutive sectors too slowly relative to the rotation speed, this can be solved by interleaving of sectors. Another problem arises when the computer issues requests for accessing random sectors too quickly relative to the access speed, this can be solved by queuing of requests. The strategy of processing queued requests is important.

  • The FIFO strategy of processing requests directs the disk to always service the first of the waiting requests. The strategy can suffer from excessive seeking across tracks.

  • The Shortest Seek First strategy of processing requests directs the disk to service the request that has the shortest distance from the current position of the disk head. The strategy can suffer from letting too distant requests starve.

  • The Bidirectional Elevator strategy of processing requests directs the disk to service the request that has the shortest distance from the current position of the disk head in the selected direction, which changes when no more requests in the selected direction are waiting. The strategy lets too distant requests starve at most two passes over the disk in both directions.

  • The Unidirectional Sweep strategy of processing requests directs the disk to service the request that has the shortest distance from the current position of the disk head in the selected direction, or the longest distance from the current position of the disk head when no more requests in the selected direction are waiting. The strategy lets too distant requests starve at most one pass over the disk in the selected directions.

The strategy used to process the queue of requests can be implemented either by the computer in software or by the disk in hardware. The computer typically only considers the current track that the disk head is on, because it does not change without the computer commanding the disk to do so, as opposed to the current sector that the disk head moves over.

Most versions of the ATA interface do not support issuing a new request to the disk before the previous request is completed, and therefore cannot implement any strategy to process the queue of requests. On the contrary, most versions of the SCSI and the SATA interfaces do support issuing a new request to the disk before the previous request is completed.

4.2.7.2.1. Example: SATA Native Command Queuing

A SATA disk uses Native Command Queuing as the mechanism used to maintain the queue of requests. The mechanism is coupled with First Party Direct Memory Access, which allows the drive to instruct the controller to set up Direct Memory Access for particular part of particular request.

4.2.7.2.2. Example: Linux Request Queuing

[Linux 2.2.18 /drivers/block/ll_rw_blk.c] Linux sice ve zdrojácích vytrvale používá název Elevator, ale ve skutečnosti řadí příchozí požadavky podle lineárního čísla sektorů, s výjimkou požadavků, které příliš dlouho čekají (256 přeskočení pro čtení, 512 pro zápis), ty se nepřeskakují. Tedy programy, které intenzivně pracují se začátkem disku, blokují programy, které pracují jinde.

[Linux 2.4.2 /drivers/block/ll_rw_blk.c & elevator.c] Novější Linux se polepšil, nové požadavky nejprve zkouší připojit do sekvence se stávajícími (s omezením maximální délky sekvence), pak je zařadí podle čísla sektoru, nikoliv však na začátek fronty a nikoliv před dlouho čekající požadavky. Výsledkem je one direction sweep se stárnutím.

[Linux 2.6.x] The kernel makes it possible to associate a queueing discipline with a block device by providing modular request schedulers. The three schedules implemented by the kernel are anticipatory, deadline driven and complete fairness queueing.

  • The Anticipatory scheduler implements a modified version of the Unidirectional Sweep strategy, which permits processing of requests that are close to the current position of the disk head but in the opposite of the selected direction. Additionally, the scheduler enforces an upper limit on the time a request can starve.

    The scheduler handles read and write requests separately and inserts delays between read requests when it judges that the process that made the last request is likely to submit another one soon. Note that this implies sending the read requests to the disk one by one and therefore giving up the option of queueing read requests in hardware.

  • The Deadline Driven scheduler actually also implements a modified version of the Unidirectional Sweep strategy, except that it assigns deadlines to all requests and when a deadline of a request expires, it processes the expired request and continues from that position of the disk head.

  • The Complete Fairness Queueing scheduler is based on the idea of queueing requests from processes separately and servicing the queues in a round robin fashion, or in a weighted round robin fashion directed by priorities.

[ This information is current for kernel 2.6.19. ]

References. 

4.2.7.3. Failures

Obsluha diskových chyb, retries, reset řadiče, chyby v software. Správa vadných bloků, případně vadných stop, v hardware, SMART diagnostics. Caching, whole track caching, read ahead, write back. Zmínit mirroring a redundantní disková pole.

RAID 0 uses striping to speed up reading and writing. RAID 1 uses plain mirorring and therefore requires pairs of disks of same size. RAID 2 uses bit striping and Hamming Code. RAID 3 uses byte striping and parity disk. RAID 4 uses block striping and parity disk. RAID 5 uses block striping and parity striping. RAID 6 uses block striping and double parity striping. The levels were initially defined in a paper of authors from IBM but vendors tend to tweak levels as they see fit. RAID 2 is not used, RAID 3 is rare, RAID 5 is frequent. RAID 0+1 and RAID 1+0 or RAID 10 combine RAID 0 and RAID 1.

4.2.7.3.1. Example: SMART Diagnostics

Linux 2.6.10 smartctl -a /dev/hda prints all device information. Attributes have raw value and normalized value, raw value is usually but not necessarily human readable, normalized value is 1-254, threshold 0-255 is associated with normalized value, worst lifetime value is kept. If value is less or equal to threshold then the attribute failed. Attributes are of two types, pre failure and old age. Failed pre failure attribute signals imminent failure. Failed old age attribute signals end of life. Attributes are numbered and some numbers are standardized.

4.2.7.4. Partitioning

Zmínit partitioning and logical volume management.

4.2.7.4.1. Example: IBM Volume Partitioning

To be done.

4.2.7.4.2. Example: GPT Volume Partitioning

To be done.

4.2.7.4.3. Example: Linux Logical Volume Management

Physical volumes, logical volumes, extents (size e.g. 32M), mapping of extents (linear or striped), snapshots.

> vgdisplay

  --- Volume group ---
  VG Name               volumes
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  10
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                3
  Open LV               3
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               1.27 TiB
  PE Size               32.00 MiB
  Total PE              41695
  Alloc PE / Size       24692 / 771.62 GiB
  Free  PE / Size       17003 / 531.34 GiB
  VG UUID               fbvtrb-GFbS-Nvf4-Ogg3-J4fX-dj83-ebh39q
> pvdisplay --map

  --- Physical volume ---
  PV Name               /dev/md0
  VG Name               volumes
  PV Size               931.39 GiB / not usable 12.56 MiB
  Allocatable           yes
  PE Size               32.00 MiB
  Total PE              29804
  Free PE               17003
  Allocated PE          12801
  PV UUID               hvfcSD-FvSp-xJn4-lsR3-40Kx-LdDD-wvfGfV

  --- Physical Segments ---
  Physical extent 0 to 6875:
    Logical volume	/dev/volumes/home
    Logical extents	0 to 6875
  Physical extent 6876 to 6876:
    Logical volume	/dev/volumes/var
    Logical extents	11251 to 11251
  Physical extent 6877 to 12800:
    Logical volume	/dev/volumes/home
    Logical extents	6876 to 12799
  Physical extent 12801 to 29803:
    FREE
> lvdisplay --map

  --- Logical volume ---
  LV Name                /dev/volumes/home
  VG Name                volumes
  LV UUID                OAdf3v-zfI1-w5vq-tFVr-Sfgv-yvre-GWFb3v
  LV Write Access        read/write
  LV Status              available
  LV Size                400.00 GiB
  Current LE             12800
  Segments               2
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2

  --- Segments ---
  Logical extent 0 to 6875:
    Type		linear
    Physical volume	/dev/md0
    Physical extents	0 to 6875

  Logical extent 6876 to 12799:
    Type		linear
    Physical volume	/dev/md0
    Physical extents	6877 to 12800