6.3.1. File Systems

6.3.1.1. Example: Network File System

Three major versions of the NFS standard are 2, 3 and 4.

Version 2 of the NFS standard is designed to permit stateless server implementation. A stateless server holds no client related state, server failure and recovery is therefore transparent to connected clients. The design relies on identifying directories and files using special file handles that refer directly to server disk data, rather than to data in memory. From the client perspective, an NFS file handle is an opaque array of 32 bytes, the server uses the NFS file handle to store the file system ID, the file ID (I-node number), and the generation ID (basically I-node version number).

const MNTPATHLEN = 1024;  /* maximum bytes in a pathname argument */
const MNTNAMLEN = 255;    /* maximum bytes in a name argument */
const FHSIZE = 32;        /* size in bytes of a file handle */

typedef opaque fhandle [FHSIZE];
typedef string name <MNTNAMLEN>;
typedef string dirpath <MNTPATHLEN>;

union fhstatus switch (unsigned fhs_status) {
  case 0:
    fhandle fhs_fhandle;
  default:
    void;
};

The NFS standard introduces the NFS protocol and the mount protocol, both built over RPC. The purpose of the mount protocol is to provide reference to the root of the exported directory tree, the NFS protocol provides operations upon directories and files.

The basic operation of the mount protocol is MNT, which returns the file handle for the root of an exported directory tree identified by a path. Complementary to MNT is UMNT, which the server uses to track currently mounted paths, however, because of the stateless design the list of currently mounted paths can become stale. The DUMP operation provides the list of paths the server believes are mounted by the client, the EXPORT operation lists the paths of all exported directory trees.

typedef struct mountbody *mountlist;
struct mountbody {
  name ml_hostname;
  dirpath ml_directory;
  mountlist ml_next;
};

typedef struct groupnode *groups;
struct groupnode {
  name gr_name;
  groups gr_next;
};

typedef struct exportnode *exports;
struct exportnode {
  dirpath ex_dir;
  groups ex_groups;
  exports ex_next;
};

program MOUNTPROG {
  version MOUNTVERS {
    void MOUNTPROC_NULL (void) = 0;
    fhstatus MOUNTPROC_MNT (dirpath) = 1;
    mountlist MOUNTPROC_DUMP (void) = 2;
    void MOUNTPROC_UMNT (dirpath) = 3;
    void MOUNTPROC_UMNTALL (void) = 4;
    exports MOUNTPROC_EXPORT (void) = 5;
    exports MOUNTPROC_EXPORTALL (void) = 6;
  } = 1;
} = 100005;

The operations of the NFS protocol resemble those of common file system interfaces, such as reading and writing files and manipulating directories. Due to the stateless design, there is no concept of opening and closing a file - instead of the pair of open and close operations, the protocol introduces the LOOKUP operation, which accepts a file name and returns a file handle that remains valid for the lifetime of the file.

program NFS_PROGRAM {
  version NFS_VERSION {
    ...
    diropres NFSPROC_LOOKUP (diropargs) = 4;
    ...
  } = 2;
} = 100003;

struct diropargs {
  nfs_fh  dir;            /* directory file handle */
  filename name;          /* file name */
};

union diropres switch (nfsstat status) {
case NFS_OK:
  diropokres diropres;
default:
  void;
};

struct diropokres {
  nfs_fh file;
  fattr attributes;
};

struct fattr {
  ftype type;             /* file type */
  unsigned mode;          /* protection mode bits */
  unsigned nlink;         /* number of hard links */
  unsigned uid;           /* owner user id */
  unsigned gid;           /* owner group id */
  unsigned size;          /* file size in bytes */
  unsigned blocksize;     /* preferred block size */
  unsigned rdev;          /* special device number */
  unsigned blocks;        /* used size in kilobytes */
  unsigned fsid;          /* device number */
  unsigned fileid;        /* inode number */
  nfstime atime;          /* time of last access */
  nfstime mtime;          /* time of last modification */
  nfstime ctime;          /* time of last change */
};

struct nfs_fh {
  opaque data [NFS_FHSIZE];
};

enum nfsstat {
  NFS_OK=0,               /* No error */
  NFSERR_PERM=1,          /* Not owner */
  NFSERR_NOENT=2,         /* No such file or directory */
  NFSERR_IO=5,            /* I/O error */
  NFSERR_NXIO=6,          /* No such device or address */
  NFSERR_ACCES=13,        /* Permission denied */
  NFSERR_EXIST=17,        /* File exists */
  NFSERR_NODEV=19,        /* No such device */
  NFSERR_NOTDIR=20,       /* Not a directory*/
  NFSERR_ISDIR=21,        /* Is a directory */
  NFSERR_FBIG=27,         /* File too large */
  NFSERR_NOSPC=28,        /* No space left on device */
  NFSERR_ROFS=30,         /* Read-only file system */
  NFSERR_NAMETOOLONG=63,  /* File name too long */
  NFSERR_NOTEMPTY=66,     /* Directory not empty */
  NFSERR_DQUOT=69,        /* Disc quota exceeded */
  NFSERR_STALE=70,        /* Stale NFS file handle */
  NFSERR_WFLUSH=99        /* Write cache flushed */
};
program NFS_PROGRAM {
  version NFS_VERSION {
    ...
    readres NFSPROC_READ (readargs) = 6;
    ...
  } = 2;
} = 100003;

struct readargs {
  nfs_fh file;            /* handle for file */
  unsigned offset;        /* byte offset in file */
  unsigned count;         /* immediate read count */
  unsigned totalcount;    /* total read count (from this offset)*/
};

union readres switch (nfsstat status) {
case NFS_OK:
  readokres reply;
default:
  void;
};

struct readokres {
  fattr attributes;       /* attributes */
  opaque data <NFS_MAXDATA>;
};

struct fattr {
  ftype type;             /* file type */
  unsigned mode;          /* protection mode bits */
  unsigned nlink;         /* number of hard links */
  unsigned uid;           /* owner user id */
  unsigned gid;           /* owner group id */
  unsigned size;          /* file size in bytes */
  unsigned blocksize;     /* preferred block size */
  unsigned rdev;          /* special device number */
  unsigned blocks;        /* used size in kilobytes */
  unsigned fsid;          /* device number */
  unsigned fileid;        /* inode number */
  nfstime atime;          /* time of last access */
  nfstime mtime;          /* time of last modification */
  nfstime ctime;          /* time of last change */
};

struct nfs_fh {
  opaque data [NFS_FHSIZE];
};

enum nfsstat {
  NFS_OK=0,               /* No error */
  NFSERR_PERM=1,          /* Not owner */
  NFSERR_NOENT=2,         /* No such file or directory */
  NFSERR_IO=5,            /* I/O error */
  NFSERR_NXIO=6,          /* No such device or address */
  NFSERR_ACCES=13,        /* Permission denied */
  NFSERR_EXIST=17,        /* File exists */
  NFSERR_NODEV=19,        /* No such device */
  NFSERR_NOTDIR=20,       /* Not a directory*/
  NFSERR_ISDIR=21,        /* Is a directory */
  NFSERR_FBIG=27,         /* File too large */
  NFSERR_NOSPC=28,        /* No space left on device */
  NFSERR_ROFS=30,         /* Read-only file system */
  NFSERR_NAMETOOLONG=63,  /* File name too long */
  NFSERR_NOTEMPTY=66,     /* Directory not empty */
  NFSERR_DQUOT=69,        /* Disc quota exceeded */
  NFSERR_STALE=70,        /* Stale NFS file handle */
  NFSERR_WFLUSH=99        /* Write cache flushed */
};

The stateless design is not entirely transparent to clients. Places where differences from local file system appear include permissions, which are normally tested on opening a file. In absence of the open and close operations, permissions are checked on each read and write instead. Furthermore, the permissions rely on the UID and GID concepts, which need to be synchronized across clients and the server. In absence of network wide authentication mechanism, the server must trust the clients to supply correct credentials.

Among the more subtle differences, certain permission checks must be relaxed - for example, a client can have a right to execute a file without the right to read the file, however, for the server both operations imply accessing file content. Lack of open and close changes behavior when deleting an open file. Finally, there is a limit on RPC argument size that forces clients to split large data operations, this breaks atomicity.

Version 3 of the NFS protocol introduces the NLM protocol for managing locks, which can be used with any version of the NFS protocol. Recovery of locks after crash is solved by introducing lease and grace periods. The server only grants a lock for a lease period. The server enters grace period longer than any lease period after crash and only grants lock renewals during the grace period.

Version 4 of the NFS protocol abandons statelessness and integrates the mount, NFS and NLM protocols, and introduces security, compound operations that can pass file handle to each other, extended attributes, replication and migration, client caching.

References. 

  1. RFC 1094: NFS Network File System Protocol Specification

  2. RFC 1813: NFS Version 3 Protocol

  3. RFC 3530: NFS Version 4 Protocol

6.3.1.2. Example: Server Message Block And Common Internet File System

TODO: Some description, at least from RFC and SMB & CIFS protocol.

6.3.1.3. Example: Andrew File System

The Andrew File System or AFS is a distributed file system initially developed at CMU. AFS organizes files under a global name space split into cells, where a cell is an administrative group of nodes. Servers keep subtrees of files in volumes, which can be moved and read only replicated across multiple servers, and which are listed in volume location database replicated across database servers.

Clients cache files, writes are propagated on close or flush. A server sends file data together with a callback, which is a function that notifies of outdated file data in cache. When a write is propagated to the server, the server notifies all clients that cache the file data that their callback has been broken. Clients renew callbacks when opening files whose file data were sent some time ago.

AFS uses Rx, which is a proprietary RPC implementation over UDP. AFS uses Kerberos for authentication. AFS uses identities that are separate from system user identities.

6.3.1.4. Example: Coda File System

The Coda File System sports a design similar to AFS, with global name space, replicated servers and caching clients. Servers keep files in volumes, which can be moved and read write replicated across multiple servers. Files are read from one server and written to all servers. Clients check versions on all servers and tell servers to resolve version mismatches.

Clients work in strongly connected, weakly connected and disconnected modes. The difference between connected and disconnected modes is that in the connected modes, the client hoards files, while in the disconnected mode, the client uses the hoarded files. The difference between strongly connected and weakly connected modes is that in the strongly connected mode, writes are synchronous, while in the weakly connected mode, writes are reintegrated.

Reintegration happens whenever there is a write to be reintegrated and the client is connected. Writes are reintegrated using an optimized replay log of mutating operations. Conflicts are solved manually.

6.3.1.5. Global File System

The Global File System is a distributed file system based on shared access to media rather than shared access to files. Conceptually, the file system uses traditional disk layout with storage pools of blocks, bitmaps to keep track of block usage, distributed index nodes that point to lists of blocks stored in as many levels of a branching hierarchy as required by file size, and journals to maintain metadata consistency. The distribution relies on most data structures occupying entire blocks and on introducing a distributed block locking protocol.

GFS supports pluggable block locking protocols. Three block locking protocols currently available are:

  • DLM (Distributed Locking Manager) uses distributed architecture with a distributed directory of migrating lock instances.

  • GULM (Grand Unified Locking Manager) uses client server architecture with replicated servers and majority quora.

  • NOLOCK makes it possible to completely remove locking and use GFS locally.