5.2.1. Disk Layout

5.2.1. Disk Layout
Prev	5.2. File Subsystem Internals	Next

Bunch of blocks. Tree. Log.

5.2.1.1. Handling of Files

Přechod z pásek na disky, první nápad se sekvenčním ukládáním souborů. Má to dvě výhody, totiž rychlost a malou režii. Nevýhodou je potřeba znát předem délku souboru a pochopitelně fragmentace.

První nápad jak tohle odstranit je udělat linked list. C64 měl tohle na floppy, nevýhody jsou zřejmé. Extrémně pomalý random access, velikost bloků není mocnina dvou, špatně se maže a tak.

Další modifikace je nechat linked list, ale vytáhnout ho z bloků a dát do tabulky. Typické řešení MS-DOSu. Nevýhodné to začne být když se celá tahle tabulka nevejde do paměti, lidi napadlo mít jí po kouskách u souborů, výsledek je třeba CP/M nebo UNIX. Co dělat když je tahle tabulka moc velká, CP/M přidává lineárně další bloky, UNIX stromově větví I-nodes.

Ukládání alokačních informací o souborech do adresářových položek, třeba alá CP/M, má ještě jednu značnou nevýhodu. Pokud totiž není možné oddělit jméno souboru od jeho alokační informace, není možné dělat hard linky.

5.2.1.2. Handling of Directories

Triviální případ jednoúrovňového adresáře, koncept ROOTu v MS-DOSu. Hierarchické adresáře, ukládání podadresářů do nadřazených adrešářů. DOSácká klasika, totéž u UNIXu. Jak do adresáře zadělat linky, koncept hard linku a symbolic linku.

Výhody a nevýhody hardlinku, rychlost, špatné mazání, nepřekročí hranici file systému. Symbolický link, totéž. Drobnost při zálohování a kopírování souborů, nutnost služby rozeznávající link od normálního souboru.

5.2.1.3. Handling of Free Space

Přidělování volného místa, problém velikosti bloků. Hlediska pro větší bloky, rychlost, malá režie, hlediska pro menší bloky, malá fragmentace.

Evidence volného místa, seznam volných bloků a bitmapy. Funkce bitmap je jasná, seznam volných bloků se zdá být nevýhodný, až na schopnost mít v paměti jen malou část tablice a přesto uspokojovat velký počet dotazů na volné bloky, a možnost ukládání právě ve volném místě. Možnost využití seznamu alá FAT. Evidence vadných bloků, vadný soubor, označení vadných bloků.

Diskové kvóty, mechanizmus hard a soft kvóty. Princip implementace, tablice otevřených souborů, tablice majitelů otevřených souborů.

5.2.1.4. Performance

Malá rychlost a malý počet bloků cache. Vhodná strategie závisí na aplikaci, úprava pro přednostní caching adresářů a I-nodes. Write-back caching, rozdělení místa mezi write-back a read cache.

Minimalizace pohybu hlavičky, umístění adresářů do středu disku, rozdělení velkého disku na segmenty. Alokace souborů do sousedních bloků, defragmentace.

5.2.1.5. Reliability

Požadavek spolehlivosti, nejjednodušším řešením je zálohování. Podpora pro zálohování, archivní atribut, detekce linků, snapshot. Zálohování na pásky, na disky, mirroring.

Konzistence systému, důležitost některých oblastí disku. Přednostní zápis adresářů, I-nodes, FAT, alokačních map a tak. Periodický sync. Kontrola konzistence při bootování systému. Unerase, nevýhody dolepovaných unerase, podpora v systému.

5.2.1.6. Example: FAT File System

Klasika, boot sektor s rozměry filesystému, za ním dvakrát FAT, za ní root directory, za ním data area. Adresářová položka obsahuje name, attributes, first cluster. Bad clusters mají extra známku ve FAT.

Nevýhody zahrnují nepohodlnou práci s FAT (je velká, nelze z ní snadno vytáhnout data týkající se jednoho souboru), nepohodlnou práci s adresáři (krátká jména souborů, málo atributů, špatné prohledávání, možnost fragmentace adresářů vymazanými jmény), režie na velké clustery.

Modifikace s rozšířením na větší čísla clusterů a delší jména souborů. Větší čísla jsou prostě tak, rozšířená na 32 bitů, žádný problém. Delší jména souborů jsou uložena do vhodných míst extra položek v adresáři, označených nesmyslnou kombinací atributů, v unicode. Hypotéza je, že to je kvůli kompatibilitě se staršími systémy, když se na diskety nahraje něco s dlouhým jménem.

5.2.1.7. Example: HPFS File System

Ačkoliv z OS/2, produkt Microsoftu. Citace z roku 1989 říká, že "HPFS solves all the problems of the FAT file system and is designed to meet the demands expected into the next few decades."

Na začátku disku je vyhrazeno 16 sektorů na bootstrap loader, následuje superblock s rozměry disku a pointery na bad block list, directory band, root directory a spareblock. Zbytek disku je rozdělen na 8 MB bands, každý band má free sectors bitmap a data area, které se střídají tak, aby bands sousedily buď bitmapami, nebo data areas.

Každý soubor je reprezentovaný strukturou F-node, která se přesně vejde do sektoru. Každý F-node obsahuje různé access control lists, attributes, usage history, last 15 chars of file name, plus an allocation structure. Allocation structure je buď 8 runs přímo v F-node, každý run je 32 bitů starting sector a 32 bitů number of sectors, nebo B+ strom o 12 větvích, jehož leaf nodes obsahují až 40 runs. Zajímavou věcí jsou extended attributes, u každého souboru se může uložit až 64 KB name/value párů, které jsou buď přímo v F-node, nebo v extra runu.

Adresáře jsou podobně jako soubory reprezentované strukturou F-node, pointer na F-node root directory je v superbloku. Adresář má položky různé délky ve 2 KB blocích uspořádaných jako B strom, položek se při jménech kolem 10 znaků vejde do 2 KB bloku tak 40. V každé položce je jméno, usage count, F-node pointer.

Jednou výhodou HPFS je alokační strategie, díky které jsou soubory ukládány pokud možno v souvislých blocích, a díky které je F-node blízko u dat souborů. Používá se prealokace po 4 KB, přebytečné bloky se vrací při zavření souboru. Samozřejmostí je read ahead a write back; dokumentace tvrdí, že se u souboru pamatuje usage pattern a podle něj se toto řídí.

Zajímavá je také fault tolerance. Systém si udržuje hotfix map, pokud narazí na chybu sektoru, tak jej do této mapy přidá a zobrazá warning, ve vhodné chvíli se pak soubory ležící v hotfixed sektorech přesunou jinam a hotfix se vyprázdní. Při power outage se podle dirty flagu ve spareblocku pozná, že vše není v pořádku, recovery pak může použít magic identifiers, které jsou přítomné ve všech zajímavých strukturách pro nalezení F-nodes a directories, které jsou navíc linked to each other.

Poznámka stranou, B strom je vyvážený strom s daty ve všech uzlech, B+ strom je vyvážený strom s daty pouze v listech. Jinak snad normální ...

5.2.1.8. Example: EXT2 And EXT3 And EXT4 File Systems

The filesystem uses the classical structure starting with a bootstrap area and continuing with blocks, each block containing a copy of the superblock, filesystem descriptors, free block bitmap, free inode bitmap, inode table fragent, and data area. These blocks serve the same role as bands or groups in other filesystems and should not be confused with equal sized blocks within the data area.

Free space is allocated by blocks. A free block bitmap is used to keep track of free blocks.

A file is represented by an inode. The inode contains information such as file type, access rights, owners, timestamps, size, and the information about where the data of the file resides, stored either as a block map or as an extent tree.

The block map contains pointers to direct blocks, a pointer to a single level indirect block, which contains pointers to direct blocks, a pointer to a double level indirect block, which contains pointers to single level indirect blocks, and a pointer to a triple level indirect block, which contains pointers to double level indirect blocks. By default, 12 pointers to direct blocks reside in the inode, 1024 pointers to indirect blocks reside in a block.

The extent tree, available starting with EXT4, describes consecutive sequences of blocks called extents. The extent tree has up to 4 entries in the inode, additional entries reside in potential interim blocks, both interim entries and leaves are 12 bytes long. This means that by default, the extent tree would need 5 levels to cover a hypothetical file with 2^32 extents.

Some versions of the filesystem could store file tails in block fragments. The inode structure therefore contains block fragment related fields, which, however, are not used in the current filesystem implementations.

struct ext2_inode {
  __u16  i_mode;			/* File mode */
  __u16  i_uid;                         /* Owner ID */
  __u32  i_size;                        /* Size in bytes */
  __u32  i_atime;                       /* Access time */
  __u32  i_ctime;                       /* Creation time */
  __u32  i_mtime;                       /* Modification time */
  __u32  i_dtime;                       /* Deletion Time */
  __u16  i_gid;                         /* Group ID */
  __u16  i_links_count;                 /* Links count */
  __u32  i_blocks;                      /* Blocks count */
  __u32  i_flags;                       /* File flags */
  __u32  i_block [EXT2_N_BLOCKS];       /* Ptrs to blocks */
  __u32  i_version;                     /* File version for NFS */
  __u32  i_file_acl;                    /* File ACL */
  __u32  i_dir_acl;                     /* Directory ACL */
  __u32  i_faddr;                       /* Fragment address */
  __u8   l_i_frag;                      /* Fragment number */
  __u8   l_i_fsize;                     /* Fragment size */
};

#define EXT2_DIR_BLOCKS		12
#define EXT2_IND_BLOCK          EXT2_DIR_BLOCKS
#define EXT2_DIND_BLOCK		(EXT2_IND_BLOCK + 1)
#define EXT2_TIND_BLOCK		(EXT2_DIND_BLOCK + 1)
#define EXT2_N_BLOCKS           (EXT2_TIND_BLOCK + 1)

#define EXT2_SECRM_FL   	0x00000001      /* Secure del */
#define EXT2_SYNC_FL    	0x00000008      /* Sync update */
#define EXT2_IMMUTABLE_FL       0x00000010      /* Immutable */
#define EXT2_APPEND_FL          0x00000020      /* Only ap */

Directories are stored either unsorted, or with a tree index of file name hashes. When stored unsorted, a directory is simply an array of variable length entries. Entries never cross block boundary, unused space within blocks is marked with empty entries. When stored with a tree index, a directory starts with a tree root block, which points to potential interim blocks and eventually blocks with leaves, which are standard directory entries. All tree blocks start with a fake empty directory entry so that directory content remains backwards compatible.

struct ext2_dir_entry_2 {
  __u32  inode;             		/* Inode number */
  __u16  rec_len;                       /* Directory entry length */
  __u8   name_len;                      /* Name length */
  __u8   file_type;			/* File type */
  char   name [EXT2_NAME_LEN];   	/* File name */
};

#define EXT2_NAME_LEN 255

#define EXT2_FT_REG_FILE        1
#define EXT2_FT_DIR             2
#define EXT2_FT_CHRDEV  	3
#define EXT2_FT_BLKDEV  	4
#define EXT2_FT_SYMLINK 	7

A quick overview of other features includes a bad block map kept in reserved inode 1, an administrator space reservation.

Pro odhad, v Linuxu je na 8GB partition celkem 1M I-nodes, z toho jsou pro běžné soubory tak 4% použitých, každý blok má 130MB. Extra atributy se dají číst a měnit přes lsattr a chattr, patří mezi ně IMM (immutable), APP (append only), SYNC (synchronous update), COMPRESS, UNDELETE, SAFEDEL (tyto tři ale kernel ignoruje ?).

Journalling mode for data is either writeback, ordered, or journal. Writeback means data are not journalled. Ordered means data are written to normal location before corresponding metadata are journalled. Journal means both data and metadata are journalled. Journalling is done to a special file.

References.

Tweedie S. C.: Journaling the Linux ext2fs Filesystem

5.2.1.9. Example: NTFS File System

Na začátku disku je jen bootsektor s rozměry disku a pointerem na MFT, jeho kopie je uložena ještě kdesi na (konci ?) partition. Celý zbytek disku se adresuje po clusterech, které jsou podobně jako u FAT mocninné násobky sektorů. Na disku neleží nic než soubory, informace o nich jsou uloženy v MFT aneb Master File Table. Každý soubor je jednoznačně identifikován indexem do MFT, MFT sama je také soubor s indexem 0, další význačné indexy jsou 1 pro MFT mirror, 2 pro transaction log, 3 pro root directory, 4 pro allocationj bitmap, 5 pro bootstrap, 6 for bad cluster file atd.

Některé ze skrytých souborů lze vypsat, chodí příkazy dir /ah $bitmap, dir /ah $badclus, dir /ah $mftmirr atd. (ovšem krom vypsání v root adresáři už tuším nic nejde). Ve Windows 2000 už zdá se nejde ani tohle.

Každý soubor je set of attributes, jeden z atributů je default, to je data stream. Každý záznam v MFT obsahuje magic, update sequence (podobná NFS, je potřeba aby při reuse záznamu bylo možné poznat staré reference), reference count, flags, potenciálně pointer na base file record pokud toto je extension file record (když se vše nevejde do base file recordu). Následují file attributes, u nich záznam obsahuje jméno, typ a data, data mohou být buď resident, v tom případě následují přímo v záznamu, nebo non-resident, v tom případě následuje v záznamu run list, což je sekvence bloků clusterů podobně jako u HPFS.

Adresáře jsou uložené jako soubory, jejichž obsah je B strom s referencemi na obsažené soubory.

Mírně zjednodušeno. Největší nevýhodou se zdá být fragmentace. To prevent fragmentation of MFT, NTFS makes sure a free area called MFT zone exists after MFT. Each time the disk becomes full, MFT zone is halved.

5.2.1.9.1. Multiple Streams

Jednotlivé streams v souboru jsou označené jmény a lze k nim přistupovat otevřením souboru se jménem file:stream_name:$stream_type. Default stream se nijak nejmenuje a typ má data, takže file a file::$data je totéž (to se používalo pro útok na Microsoft Information Server, který u jmen s explicitně uvedeným streamem nepoznal příponu, takže člověk mohl číst zdrojáky skriptů). Legračně délka souboru odráží pouze default stream, takže data v dalších streamech zabírají místo na disku, ale v adresářích nejsou vidět (také se leckdy nekopírují, Explorer OK, ale FAR ne).

Některé konkrétní atributy, krom $DATA nepřístupné aplikaci:

Atribut	Obsah
$VOLUME_VERSION	Volume version
$VOLUME_NAME	Disk's volume name
$VOLUME_INFORMATION	NTFS version and dirty flag
$FILE_NAME	File or directory name
$STANDARD_INFORMATION	File time stamps and hidden, system, and read-only flags
$SECURITY_DESCRIPTOR	Security information
$DATA	File data
$INDEX_ROOT	Directory content
$INDEX_ALLOCATION	Directory content
$BITMAP	Directory content mapping
$ATTRIBUTE_LIST	Describes nonresident attribute headers

Ještě pozoruhodněji, Windows dlouhou dobu neměly pro práci se streams žádné API, tedy nešel snadno vypsat seznam streams apod. Řešení nabízela funkce BackupRead, která ze souboru vyrobí speciální backup stream, určený pro zálohování. Tento stream obsahuje data potřebná pro kompletní rekonstrukci soubor, tedy i streams a jeho formát je známý. Zdá se, že i ACL jsou uložené jako stream (?).

HANDLE FindFirstStreamW (
  LPCWSTR lpFileName,
  STREAM_INFO_LEVELS InfoLevel,
  LPVOID lpFindStreamData,
  DWORD dwFlags);
BOOL FindNextStreamW (
  HANDLE hFindStream,
  LPVOID lpFindStreamData);

typedef enum _STREAM_INFO_LEVELS {
  FindStreamInfoStandard
} STREAM_INFO_LEVELS;

typedef struct _WIN32_FIND_STREAM_DATA {
  LARGE_INTEGER StreamSize;
  WCHAR cStreamName [MAX_PATH + 36];
} WIN32_FIND_STREAM_DATA;

The only thing worth noting on the stream enumeration interface is probably the inconsistent use of system constants suggested by the need to add an arbitrary constant to MAX_PATH.

5.2.1.9.2. Cache Manager

Quoted from [Mark Russinovich, David Solomon: Windows XP: Kernel Improvements Create a More Robust, Powerful, and Scalable OS]

In order to know what it should prefetch, the Windows XP Cache Manager monitors the page faults, both those that require that data be read from disk (hard faults) and those that simply require data already in memory be added to a working set (soft faults), that occur during the boot process and application startup. By default, it records 120 seconds of the boot process, 60 seconds after all services have finished initializing, or 30 seconds after the shell starts, whichever occurs first. The Cache Manager also monitors the first 10 seconds of application startup.

After collecting a trace organized into faults taken on MFT (if the application accesses files or directories), the files referenced, and the directories referenced, the Cache Manager notifies the prefetch component of the Task Scheduler that performs a call to the internal NtQuerySystemInformation system call requesting the trace data. After performing post-processing on the trace data, the Task Scheduler writes it out to a file in the /Windows/Prefetch folder. The file's name is the name of the application to which the trace applies followed by a dash and the hexadecimal representation of a hash of the file's path. The file has a .pf extension, so an example would be NOTEPAD.EXE-AF43252301.PF. An exception to the file name rule is the file that stores the boot's trace, which is always named NTOSBOOT-B00DFAAD.PF (a convolution of the hexadecimal-compatible word BAADF00D, which programmers often use to represent uninitialized data).

When the system boots or an application starts, the Cache Manager is called to give it an opportunity to perform prefetching. The Cache Manager looks in the prefetch directory to see if a trace file exists for the prefetch scenario in question. If it does, the Cache Manager calls NTFS to prefetch any MFT references, reads in the contents of each of the directories referenced, and finally opens each file referenced. It then calls the Memory Manager to read in any data and code specified in the trace that's not already in memory. The Memory Manager initiates all of the reads asynchronously and then waits for them to complete before letting an application's startup continue.

5.2.1.9.3. Backup Support

Quoted from [Mark Russinovich, David Solomon: Windows XP: Kernel Improvements Create a More Robust, Powerful, and Scalable OS]

A new facility in Windows XP, called volume shadow copy, allows the built-in backup utility to record consistent views of all files, including open ones. The shadow copy driver is a type of driver, called a storage filter driver, that layers between file system drivers and volume drivers (the drivers that present views of the disk sectors that represent a logical drive) so that it can see the I/O directed at a volume.

When the backup utility starts a backup operation it directs the volume shadow copy driver (/Windows/System32/Drivers/Volsnap.sys) to create a volume shadow copy for the volumes that include files and directories being recorded. The volume shadow copy driver freezes I/O to the volumes in question and creates a shadow volume for each. For example, if a volume's name in the Object Manager namespace is /Device/HarddiskVolume0, the shadow volume might be named /Device/HarddiskVolumeShadowCopyN, where N is a unique ID.

Instead of opening files to back up on the original volume, the backup utility opens them on the shadow volume. A shadow volume represents a point-in-time view of a volume, so whenever the volume shadow copy driver sees a write operation directed at an original volume, it reads a copy of the sectors that will be overwritten into a paging file-backed memory section that's associated with the corresponding shadow volume. It services read operations directed at the shadow volume of modified sectors from this memory section, and services reads to non-modified areas by reading from the original volume.

Because the backup utility won't save the paging file or the contents of the system-managed /System Volume Information directory located on every volume, the snapshot driver uses the defragmentation API to determine the location of these files and directories, and does not record changes to them.

By relying on the shadow copy facility, the Windows XP backup utility overcomes both of the backup problems related to open files.

The shadow copy driver is actually only an example of a shadow copy provider that plugs into the shadow copy service (/Windows/System32/Vssvc.exe). The shadow copy service acts as the command center of an extensible backup core that enables ISVs to plug in writers and providers. A writer is a software component that enables shadow copy-aware applications to receive freeze and thaw notifications in order to ensure that backup copies of their data files are internally consistent, whereas providers allow ISVs with unique storage schemes to integrate with the shadow copy service. For instance, an ISV with mirrored storage devices might define a shadow copy as the frozen half of a split-mirrored volume.

5.2.1.9.4. Summary

References.

Russinovich M.: Inside NTFS
Wijk J. v.: NTFS Disk Structure Definitions

5.2.1.10. Example: XFS File System

XFS has been designed by SGI and provides support for large numbers of large files in large directories accessed by large numbers of clients. This is achieved by using balanced trees for most structures and by providing metadata logging.

XFS divides the disk into groups, each group contains metadata and data areas. The group metadata area contains a copy of the superblock, pointers to roots of two group free block trees, a pointer to the root of a group inode tree, and a reserved group free block list. The data area is split into equal sized blocks. Most references used by XFS come in two flavors, a relative reference within a group and an absolute reference, which is created by prepending the group identifier to the relative reference.

Free space is allocated by blocks. The two free block trees allow locating free space by block number and by extent size, each leaf of a tree points to a free extent. The reserved free block list contains free blocks reserved for growing the free block trees.

Files are represented by inodes. The inode tree allows locating an inode by inode number, each leaf of the tree points to a block with a sparse array of 64 inodes. An inode contains basic information about a file and points to file data and extended attributes using structures called a data fork and an attribute fork. Depending on the size of the data referenced by the fork, the data is stored:

directly within the fork inode. The default size of an inode is 256 bytes, out of which 100 bytes are used by the basic information, leaving 156 bytes for the forks.
in extents listed within the fork inode. The default size of an inode provides enough space for up to 19 extents.
data in a tree. When the file data is stored in a tree, the keys of the tree are offsets within the file and the leaves of the tree are extents.

Directories use either short form, block form, leaf form, node form, or tree form, depending on their size. All forms have a variable length entry containing the file name and the file inode.

A short form directory stores entries directly within its inode.
A block form directory stores entries in a single extent, which also contains a sorted array of file name hashes and an array of a few largest free entries in the extent.
A leaf form directory stores entries in multiple entry extents and single extent with a sorted array of file name hashes and an array of a few largest free entries in the entry extents.
A node form directory stores entries in multiple entry extents, a tree of file name hashes and an extent with an array of a few largest free entries in the entry extents.
Finally, a tree form directory uses trees to store the array of entries, the file name hashes, and the array of a few largest free entries in the entry extents.

Attributes use either short form, leaf form, node form, or tree form, depending on their size. The forms of attribute storage are similar to the forms of directory storage, except that the names and values of attributes are kept together with name hashes, while the entries were kept separate from name hashes.

Metadata modifications are journalled.

References.

SGI: XFS Filesystem Structure. http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf
SGI: XFS Overview and Internals. http://oss.sgi.com/projects/xfs/training/index.html

5.2.1.11. Example: CD File System

Standard ISO9660 a ECMA119. Disk rozdělen na sektory zpravidla 2048 bytes, prvních 16 sektorů prázdných pro bootstrap loader. Zbytek disku je popsán sekvencí volume descriptors, jeden per sektor, nejdůležitější je Primary Volume Descriptor s adresou root directory, path table a dalšími zbytečnostmi (copyright, abstract, bibinfo). Adresáře jsou usual stuff, name of 30 chars max, attributes, soubor je udán počátečním sektorem a délkou (teoreticky je možné uvést víc adresářových položek pro soubor z více fragmentů, ale nepoužívá se, stejně jako se zdá se nepoužívá interleaving).

Jedním zajímavým detailem v ISO9660 jsou path tables. Aby se adresáře nemusely prohledávat item by item, uloží se do path table seřazený (podle hloubky a dalších kritérií) seznam všech cest na disku, pro každou cestu obsahuje path table sektor příslušného adresáře a jeho parenta.

The standard imposes a number of weird limits on the file system structure, such as maximum directory nesting depth of 8, only capital letters, digits and underscores in file names, no extensions in directory names, etc. For this reason, extensions such as Joliet and Rock Ridge have appeared.

References.

Erdelsky P. J.: ISO9660 Simplified For DOS/Windows

5.2.1.12. Example: UDF File System

Standard ISO13346 a UDF a ECMA167. Základní principy podobné ISO9660 a ECMA116.

Zajímavý je koncept Prevailing Descriptors pro připisovatelná média. Každý deskriptor má u sebe verzi, či lépe pořadové číslo, a pokud se v seznamu deskriptorů najde více deskriptorů téhož typu, uvažuje se ten s nejvyšší verzí. Protože seznam deskriptorů není (nemusí být) ukončený, lze na jeho konec připisovat nové deskriptory, které nahradí staré. S tím souvisí ještě koncept Virtual Allocation Table, která mapuje logické na fyzické sektory disku. Všechny údaje na disku jsou v logických sektorech, když je potřeba například přepsat část souboru, mohou se adresáře i zbytek souboru nechat tam kde jsou a jen se upraví mapování.

5.2.1.13. Example: JFFS2 File System

JFFS2 is a journalling file system that accommodates the specific nature of the flash memory devices, which are organized in blocks that need to be erased before writing.

The file system views the entire flash memory device as a log consisting of arbitrarily arranged blocks. The log contains records called nodes, nodes fill up blocks but may not span block boundaries so that independent garbage collection of blocks remains possible. There are three types of nodes:

An inode node, which contains metadata and optionally a fragment of data belonging to a file. Compression of the fragments is supported, a fragment should generally fit a memory page on the host.
A dirent node, which contains the inode number of the directory that the entry belongs to, and the name and inode number of the file that the entry describes.
A cleanmarker node, which marks a successfully erased block.

All the inode and dirent nodes contain a version number. An update of a node is done by writing a new version of the node at the tail of the log. When the file system is mounted, the blocks of the log are scanned (not necessarily from head to tail, due to independent garbage collection of blocks), creating an overview of the latest versions of all nodes.

Garbage collection frees space for the tail of the log by picking a random block and copying whatever of its content is not outdated to the tail of the log. Statistical preference is given to blocks with at least some outdated content, so that proper balance between precise wear levelling and increased wear associated with copying is maintained.

References.

David Woodhouse: JFFS: The Journalling Flash File System. http://sources.redhat.com/jffs2/jffs2.pdf

5.2.1.14. Example: Spiralog File System

Tohle je zajímavý systém od Digitalu, založený na log structure.

The file system consists of multiple servers and clerks. Clerks run near client applications and are responsible for caching and cache coherency and ordered write back. Servers run near disks and are reponsible for carrying out idempotent atomic sets of operations on behalf of clerks. Disks can be attached to multiple servers but only one of those servers accesses the disks, one of the remaining servers is chosen to access the disks if the current server fails.

Clerks present clients with files that can have user defined attributes and multiple data streams. Servers store files in an infinite log.

At top level, server handles objects with unique identification, with multiple named cells for storing data accessed in one piece, and with multiple numbered streams for storing data accessed in multiple pieces. Files are mapped onto objects with attributes in cells and contents in streams. Directories are mapped onto objects with attributes and entries in cells.

At medium level, server handles infinite log. Objects are mapped into B tree stored in the log, the keys are object identifiers, the leaves are objects. Cells and streams are mapped into B trees of a single leaf of the object B tree. When cells are stored in a B tree, the keys are names of the cells and the leaves denote cell data. When a stream is stored in a B tree, the keys are positions in the stream and the leaves denote stream extents. Optimizations that store short extents within their leaves also apply.

At bottom level, server handles segments. Segments are blocks of consecutive sectors 256 kB long. A segment consists of a data area and a commit record area that are written in two physical phases for each logical write. Log is mapped into segments using a segment array. Cleaner process compacts the old segments by copying. Checkpointing process keeps the number of B tree change records that have to be applied during tree reconstruction down to a reasonable limit.

References.

Johnson J.E., Laing W.A.: Overview of the Spiralog File System, Digital Technical Journal 8(2), DEC, 1996
Whitaker C., Bayley J., Widdowson R.: Design of the Server for the Spiralog File System, Digital Technical Journal 8(2), DEC, 1996

5.2.1.15. Example: Reiser File System

Další méně obvyklý systém, chodí pod Linuxem a zaměruje se na efektivitu při práci s velkým množstvím malých souborů. Problém s malými soubory je overhead při alokaci, který je tady řešený tak, že se na celý disk pohlíží jako na jeden B* strom. Uzly tohoto stromu jsou v blocích, které jsou násobky velikosti sektoru, uzel je buď nepřímý, pak obsahuje pouze klíče a pointery na potomky, nebo přímý formátovaný, pak obsahuje seznam prvků uložený tak, že od začátku uzlu narůstají hlavičky (directory item, indirect data, direct data) a od konce těla prvků, nebo přímý neformátovaný, pak obsahuje data velkého souboru do násobku velikosti bloku.

Celý tenhle cirkus zaručuje, že se malé soubory budou ukládat pohromadě do jednoho bloku, čímž se spoří místo. To, kam přesně se co uloží, je dané klíčem. Klíče jsou proto udělané tak, že obsahují vždy parent object ID, local object ID, offset, uniqueness, čímž se zaručuje, že všechny objekty budou pohromadě u svého parenta (například directory entries z jednoho directory pohromadě). Bloky jsou alokované near each other, evidují se v bitmapě, bitmapy jsou rozmístěny mezi datovými bloky, vždy jeden blok bitmapa a pak tolik bloků data, kolik se dá popsat v jednom bloku bitmapy.

Ještě zajímavá je konzistence. Problémem u takto složitého file systému je situace, kdy se kvůli vyvážení stromu musí přepisovat již existující struktury. Pokud v tu chvíli systém spadne, hrozí poškození starých dat. Původní verze file systému toto řešily tak, že zavedly uspořádání na všech zápisech na disk tak, aby zápisy dat v nových pozicích předcházaly smazání dat ve starých pozicích. To bylo ale závěrem příliš složité, takže teď se při odebrání položky změní pozice bloku tak, aby se nepřepisovala stará verze (hledá se nejbližší volné místo) a stará verze se uloží do preserve listu. Preserve list se vyprázdní, když v paměti nejsou žádné bloky, do kterých se přidávaly položky. Ještě novější verze mají log.

Krom zjevné úspornosti má také problémy, jedna je s rychlostí u souborů, které jsou mírně menší než bloky, protože ty se ukládají jako dva direct items. Druhá je u preserve listu při velkém počtu malých souborů, je potřeba často flushovat aby se mohl vyprázdnit. Třetí jsou problémy s memory mapped files když soubor není aligned. Plus samozřejmě kdo to má psát, celý EXT2 má pod 200K zdrojáků, ReiserFS má včetně patchů kernelu víc než mega.

References.

ReiserFS Whitepaper
Kurz G.: The ReiserFS Filesystem
Buchholz F.: The Structure of the ReiserFS Filesystem

5.2.1.16. Example: BTRFS File System

To be done.