systemd

During the lecture #3, we mentioned systemd very briefly, ran some commands to start, stop and reload some units, and wrote a trivial systemd unit. The goal was to provide you with a very rough idea of what systemd is used for and how to control it to get you started. Today, we’d like to make the description a tiny bit more nuanced.

PID 1

systemd is running as process number 1 (PID 1), a process traditionally called the init process. In a way, init is the most important process in the system, since when it dies—for example, when it hits a bug—the kernel panics. PID 1 has two jobs: to bring the system up to a usable state, and to reap zombies.

Let’s talk zombies first. In UNIX, every process has got a parent. When a process forks (becomes two processes), the original process is the parent, and the new process is the child (of that parent). When you run ps -ejfH, what you see is a listing of all processes on the machine, organized as a tree from those parent-child relationships. You should be able to see /sbin/init with PID 1 running, and you should see all user-space processes as children of PID 1.

(What you see is not a tree, it’s in fact a forest of two trees: one rooted at PID 1, and another rooted on kthreadd. We only care about the user-space subtree rooted at PID 1 for now.)

When processes die—either of natural causes, or because they are killed—interesting things happen:

  • When a parent process dies, its child processes (if any) become orphaned, and PID 1 becomes their new parent. This is called reparenting of processes.
  • When a child process dies, it does not disappear immediately. It will stay around, waiting for its parent process to collect its exit status. A dead process which lingers around is called a zombie process.

The reason zombies stay around is to give the parent a chance to capture the exit status of their child. In general, the parent cares about the exit code: for example, when you launch a command in your shell, a new process is created running the command, with your shell as the parent. The shell then waits for the command to exit, and captures the exit code. That allows you do useful things, such as

[ 1 -eq 1 ] || printf >&2 "\e[6m\U1F631\n\e[0m"

The zombie processes stay around until somebody collects their exit code. When a process becomes orphaned, the job of collecting the exit code falls onto PID 1. Note that PID 1 does not really care about the exit code, it collects the exit code to make sure that the zombie process disappears from the system. This is called reaping the zombies (yes, really). If PID 1 did not collect the exit code, the system would eventually accumulate so many zombies that there would be no PIDs left for new processes, rendering the system unusable.

The other part, bringing the system up, is much more tricky.

Bringing the system up

When the kernel boots, the first process it starts is the init—that explains why its PID is 1. The init process then goes on to do several things:

  • It performs many one-off tasks needed to make the system usable, such as mounting additional file systems, loading kernel modules or restoring sysctls.
  • It starts so-called long-running services, from the NTP daemon through OpenSSH to your display manager (when you log in, if you have one).

Making sure all the services are started and then keeping them running is simple in theory, but difficult in practice. One of the problems we are about to run into is starting the services in the correct order.

We are lucky enough to have an instance of this problem in our infrastructure. Suppose you wanted to manage your VMs and VDE switches with systemd, instead of starting them manually. You know you need to start the VDE switch daemons first, since that gives you the switch your VM connects to. If you started your VM first, it wouldn’t really work:

~% qemu-system-x86_64 [...] -nic vde,sock=/tmp/switch
qemu-system-x86_64: -nic vde,sock=/tmp/switch: Could not open vde: No such file or directory

Well, that makes sense. You cannot really connect to a switch which isn’t there. So the switch needs to be started first, and only then QEMU can be started. That sounds easier than it is.

The problem is that when you run vde_switch, it takes a while for the directory with the communication sockets (/tmp/switch) to appear, and then you still need to wait for the control socket (/tmp/switch/ctl) to be created. So not only does the initialization of the switch take a bit of time, it’s not atomic. So how do we know when the switch is ready, so that we can start QEMU? There are multiple solutions to this problem:

  • We could be trying to launch the QEMU VM in an infinite loop. Since we don’t want the loop to spin too fast, we would need to add a short delay between successive attempts—say, 1 second. This isn’t great, since it’s the equivalent of polling, but it makes up for that in terms of simplicity and robustness.
  • We could wait for /tmp/switch/ctl to appear using something like inotifywait. This is more efficient and the VM will be started faster (since there’s no artificial 1 second delay), but it’s more complex. One of the things vde_witch does is that it sets the mode and owner of the control socket, so waiting for the socket to appear is not enough. Waiting for the mode and owner to be correct efficiently is non-trivial.
  • vde_switch could notify us (somehow) when it’s ready (this is called readiness notification). That would add a little bit of complexity to vde_switch, but would remove a lot of complexity on our end, so it’s definitely worth it. Unfortunately, vde_switch does not a readiness notification mechanism of any kind.

And then there’s one more option. Traditional UNIX daemons, including vde_switch, perform the following sequence when they start:

  • They usually start privileged (running as root) and perform privileged tasks. For example, sshd binds to port 22 (bindig to ports < 1024 is a privileged operation) and vde_switch sets up the VDE communication directory (changing owner of the socket is a privileged operation).
  • They fork twice to detach from the original parent process and have the “middle” process die. This way, they become orphans and are reparented by PID 1. In the process, many of them would relinquish their privileges and run as some low-privileged system account, to make themselves unattractive hacking targets.
  • They continue to run in this set-up, detached from the original parent process, as children of PID 1, indefinitely.

This scheme has many shortcomings (see the SysV Daemons section in daemon(7)) and is generally frowned upon by systemd and other modern init systems such as runit or s6. In the case of vde_switch however, this brings an advantage—we can tell when the switch is ready: after it has forked.

Btrfs: Snapshots and multi-device file systems

Until now, we didn’t use any features of Btrfs which make it special. Sure, copy-on-write was happening behind the scenes, but who cares? Today, that is bound to change as we explore two of the main selling points of Btrfs, snapshots and the support for multi-device file systems with RAID-like characteristics.

Snapshots

Btrfs allows you to take snapshots, frozen-in-time copies of a file system. Due to the copy-on-write nature of Btrfs, taking a snapshot is an incredibly cheap and safe operation.

Snapshots don’t concern individual files, but rather subvolumes. Each subvolume is an independent Btrfs file system which you can mount. Even though you likely did not create a subvolume explicitly when you created your Btrfs file system, / in your VM is a subvolume (called the top-level subvolume) and you have it mounted as your root file system.

Subvolumes can be nested. For example, you can create a subvolume in your home directory:

~% btrfs subvolume create my-subvol
Create subvolume './my-subvol'
~% ls -lad my-subvol
drwxr-xr-x 1 d d 0 Nov  2 12:27 /home/d/my-subvol

For many practical purposes, the subvolume behaves like a regular directory. For example, you can rmdir it, or you can btrfs subvolume delete it. The similarity with directories is however superficial:

  • Unlike a directory, every subvolume has its own inode numbers
  • You can mount a subvolume, you cannot mount a directory
  • When you create a snapshots, nested directories are part of the snapshot, while nested subvolumes are not
  • (Read SysadminGuide/Subvolumes for details.)

You can create a snapshot of your entire file system with a single command:

~% sudo btrfs subvolume snapshot / my-snapshot
Create a snapshot of '/' in './my-snapshot'

my-snapshot is a new subvolume which appears identical to your top-level subvolume (i.e., your entire file system tree), but it is an independent copy. Since Btrfs is copy-on-write, no data is copied over to the snapshot, and the operation completes instantly.

The source subvolume and the snapshot share all data and metadata blocks, but they are distinct file systems. This has interesting consequences:

  • When you modify the source subvolume, the snapshot is unaffected.
  • When you modify the snapshot, the source subvolume is unaffected.
  • When you delete the source subvolume entire, or the snapshot, the other remains.
  • When you delete a file from the source subvolume but not from the snapshot, no disk space is freed.
  • When some blocks on the underlying block device get corrupted, incident files will be corrupted in both the source subvolume and the snapshot, since they share the blocks.

Snapshots as backups

Could snapshots be used as backups? Yes, providing you understand the caveats. For example, the following is a great backup strategy:

  • Take a snapshot every hour
  • Keep at most 24 most recent snapshots (delete all older snapshots)

This can be implemented in a couple lines of shell, and it is in fact one of the most practical backup solutions you can think of:

  • It’s dead simple
  • The snapshots consume space proportional to the amount of data changed since the oldest snapshot you maintain—in other words, if the data change little, the snapshots cost virtually nothing
  • To recover a subvolume, you can mount one of its snapshots instead
  • To recover a file, just copy it from one of the snapshots

However, TANSTAAFL:

  • If you only keep snapshots as backups, and if they reside on the same device as the source suvolume, then if that device gets corrupted or stolen, you lose all backups.
  • As mentioned above, the subvolume and its snapshots share data blocks, so even minor corruption will cause files incident with that block, in both the source subvolume and the snapshots, to be unreadable.
  • Over time, data does actually change and the space occupied by snapshots grows. Proper backups need to span several years to be useful, since a lot of time can pass before you realize you deleted something important, or that a file has become corrupted. It’s usually not practical to keep snapshots for more than a few weeks.
  • When your disk inevitably fills up and you hunt for disk space, you can either start deleting snapshots, or you can start deleting files from all the snapshots (it happened to me several times that I accidentally created a huge file which became part of the snapshots). Both options are clunky and dangerous.

So even though snapshots are a great local backup, you still at least one more copy of your data which is stored somewhere else. Popular choices:

  • You can sync your files to another machine using rsync or Duplicity,
    • Or you can use a large capacity hard drive stored in a secure place. Please note that having an external hard-drive laying around next to your laptop—to which it is always plugged—is extremely convenient, but it only protects against a very narrow class of mishaps.
  • You can set up DR:BD,
  • You can use snapshot-based differential back-ups based on btrfs send and btrfs receive,
  • (Yes, K., you could use Windows Backup, but please don’t.)

There are many other options, each with its own pros and cons. Usually, convenience is at odds with security. In the not so far future, we’ll set up a snapshot-based backup scheme which seems to strike a good balance between being reasonably convenient and yet pretty secure.

Multi-device Btrfs file systems

During the last lecture, we described RAID and several standard RAID levels. Btrfs allows you to create a file system spanning multiple drives with similar properties:

~% sudo mkfs.btrfs -d raid5 /dev/loop0 /dev/loop1 /dev/loop2

You can then mount the file system simply by mounting one of the constituent devices:

~% sudo mount /dev/loop0 mnt/raid

Say there’s a mishap:

~% sudo dd if=/dev/zero of=/dev/loop0 # oops!

You won’t be able to mount the filesystem the next time. dmesg has got some details:

[262895.245473] BTRFS error (device loop1): devid 1 uuid a7b5f9f9-46d7-42d1-a117-6e28aef584a6 is missing
[262895.245485] BTRFS error (device loop1): failed to read chunk tree: -2
[262895.264983] BTRFS error (device loop1): open_ctree failed

You can however mount the file system with -o degraded:

~% sudo mount -o degraded /dev/loop1 mnt/raid
~% cd mnt/raid
~/mnt/raid% sudo btrfs filesystem show
[...]
Label: none  uuid: 86296659-9c9e-4440-9593-bbe06e03bb42
        Total devices 3 FS bytes used 144.00KiB
        devid    1 size 0 used 0 path  MISSING
        devid    2 size 10.00GiB used 1.26GiB path /dev/loop1
        devid    3 size 10.00GiB used 1.26GiB path /dev/loop2

You can then replace the missing device with a new one:

~/mnt% sudo btrfs replace start 1 /dev/loop0 raid
~/mnt% sudo btrfs replace status raid
Started on  2.Nov 14:36:04, finished on  2.Nov 14:36:04, 0 write errs, 0 uncorr. read errs

Homework

This homework has got a two-week deadline (strict):

  • Thursday 2022-11-17 9:00 Prague time for the Thursday group
  • Monday 2022-11-21 9:00 Prague time for the Thursday group (for all the tasks—extended due to delay on my part while finalizing the exercises)
  • Monday 2022-11-21 9:00 Prague time for the Monday group

Please try to get it done during the first week. As usual, if anything is unclear, don’t hesitate to ask.

Bringing up the infra with systemd

  • Right now, we’re running all our VMs in tmux. That sort of works, but is not very robust. Let’s fix that.
  • Write systemd units for your VMs, the VDE switches and everything else you need to bring your virtual infrastructure up reliably.
  • You cannot install system-wide units (that would be inherently unsafe), but systemd allows you to manage user units. Take a look at systemd/User. Lingering has already been enabled for your account on all hypervisors.
  • Make sure that your units are started in the correct order, and that they are restarted when they fail.
  • Hints (ingredients):
    • systemd.service(5)
    • Type=forking
    • After=
    • Requires=
  • 10 bonus points for using service templates where it makes sense.
  • Make sure you understand the edge cases of your set up.
    • What happens when some VM unit crashes?
    • What happens when you stop some of the vde_switch units?
    • When all units are stopped and you start a VM unit, is the VDE switch started as well?
    • What if the VDE switch cannot be started?
  • Don’t forget to enable relevant units.
  • If you want to describe your setup, you can do so in hw/05/00-systemd-user, but it’s not required.
  • (20+10 points)

Set up Snapper

  • Install and configure Snapper on all your VMs to take regular snapshots of your file system
  • Make sure not to snapshot /var and /tmp (hint)
  • Make sure that your snapshot policy is reasonable and that you clean up old snapshots periodically
  • 10 bonus points for writing a quick’n’dirty shell script which sets up the other VMs over ssh (don’t log in as root though). Please submit the script as hw/05/01-snapper-setup.sh and describe how to use it. This time, don’t worry about best practices and just whip up something that works.
  • (20+10 points)

$repo/hw/05/03-ext4-dead

  • On every hypervisor you may find a new file called ~/hw/05/03-ext4.img. The file contains an ext4 file system.
  • The file system is corrupt.
  • Use /ctf from that file system as the answer in $repo/hw/05/03-ext4-dead.
  • Hint: in our setup, how do you use a file as a block device (a “drive”)?
  • (10+0 points)

$repo/hw/05/04-btrfs-raid-dead

  • On every hypervisor you may find new files called ~/hw/05/04-sd{a,b,c}.img. Those are three drives which make up a multi-device RAID 5-like file system.
  • One of the drives was accidentally wiped out by a careless sysadmin.
  • Repair the file system and use /ctf from the file system as the answer in $repo/hw/05/04-btrfs-raid-dead.
  • It’s not enough to recover /ctf, that’s rather trivial. Please do fix the file system. Use the wiped out device as a new member of the file system.
  • (15+0 points)

$repo/hw/05/05-resizing

  • On every hypervisor you may find a new file called ~/hw/05/05-disk.img. This is an image of a block device containing two partitions:
    • p1: 400 MiB partition with Btrfs file sytem
    • p2: 400 MiB partition with ext4 file system
  • Please change the disk layout:
    • p1: grow to 500 MiB
    • p2: shrink to 300 MiB
  • Please do not create new file systems (don’t use mkfs.*). The following is not what we are after in this exercise:
    • Copy all files from p1, p2 and p3
    • Create a new partition table
    • Create new file systems
    • Copy the files back
    • Please don’t do it this way, that’s not the point
    • In fact, there are no files in those file systems, it’s the file systems (the on-disk data structures) that we care about. Just imagine this is your disk, you’re running Archiso and the task at hand is to change the disk layout without disrupting the file systems or their contents.
  • Hints:
    • It may be a good idea to back up the disk.img, just in case you need to start over.
    • I needed to dd one of the partitions aside and dd it back later.
    • It would be wise to make sure the file systems are not corrupt after the resizing.
  • +10 bonus points if you can do this with just fdisk, dd and standard tools for Btrfs and ext4.
  • Please describe the process in $repo/hw/05/05-resizing
  • (15+10 points)

$repo/hw/05/06-sandwich

  • Pick any hypervisor and create a 2 TiB file which does not reserve any data blocks (i.e., the logical size of the file is 2 TiB) at ~/hw/05/06-sandwich.img and attach it as a block device to some Linux instance
  • LVM:
    • LVM#Background
    • Create a physical volume on that 2 TiB block device
    • Create a volume group vg0 containing that physical volume
    • Create a 1 TiB logical volume lv0 using vg0
  • On lv0, create a 20 GiB partition
  • On that 20 GiB partition, create an ext4 file system
  • In that ext4 file system, create a 1 GiB preallocated file /btrfs.img (that is, the file has logical size of 1 GiB and also reserves 1 GiB of space in the file system)
  • Attach that file as a block device to your VM (hint: losetup(8))
  • On this smaller block device, create a partition spanning the entire device
  • Create Btrfs on that 1 GiB partition
  • In the Btrfs file system, create a file /hello and write something nice to it.
  • Note that this is not a practical setup, rather, this is meant to show the immense flexibility of the Linux storage stack, and the extent to which you are already able to use it!
  • (20 points)

hw/05/09-feedback

  • If you have any valuable feedback, please do provide it here.
  • Points are only awarded for feedback which is actionable and can be used to improve the quality of the course.
  • Any constructive criticism is appreciated (and won’t be weaponized).

(Bonus) Going distributed

  • Migrate each of your VMs to a different hypervisor.
  • For networking to work, you need to make your switches distributed. 2 VDE switches (sw1, sw2) need to run on every hypervisor and same switches on different hypervisors need to be interconnected with a tunnel through the physical network
    • Hint: to connect the switches together see VDE2 README, look for vde_plug
  • Use systemd user units to set everything up
  • Optionally, you can link the switches together in a redundant configuration, in which case don’t forget to enable FSTP in the management interface of all the switches. Note, I didn’t test this and FSTP support seems funky.
  • The resulting configuration will be equivalent to the /tmp/vde-backbone.sock switch we’re using
  • (0+30 points)

(Total = 100+60 points)

Don’t forget to git push all your changes! Also, make sure that VM still works by the deadline—otherwise we have no way of grading your work.