Arch Linux VMs, part 1 (2) | Lecture | NSWI106

Information below is not for the current semester. The current semester can be found here.

Introduction

One of the main goals of this course is to learn to manage a Linux server. So, we need a server to begin with! It would be quite impractical (and expensive) if everyone were to use a “real” (physical) machine, and we will therefore be using virtual machines instead.

Before class reading

Hardware virtualization is a complex topic, and this short reading material is by no means a replacement for a full course (such as NSWI150). The goal is to give you some initial context around which you will can organize your knowledge. Please note that we only describe a particular kind of virtualization called hardware-assisted virtualization, and only as it works on the x86 platform.

Hardware Virtualization

There are many different types of “virtualization” in computing, only some of which are relevant to this course. For us, the most important type of virtualization is hardware virtualization, and for now, that’s the only kind of virtualization we care about.

The goal of hardware virtualization is to allow multiple operating systems to execute at the same time on a single physical machine. One of these operating systems acts as the hypervisor: it monitors the execution of the other operating systems and intervenes when necessary to make sure that they can share hardware safely. The physical machine on which the hypervisor runs is called the host, while the operating systems managed by the hypervisor are called guests.

The relationship between the hypervisor and the guests is similar to the relationship between an operating system and the processes (running programs) it controls—they too need to share the hardware of the machine in a safe and secure way.

In the simplest case, the hypervisor has exclusive access to the machine’s peripheral devices (disk controllers, network cards, etc.), while the guests use emulated hardware. The ~~host~~ guest operating system and the hardware it runs on—some of which is virtualized with hardware support and some of which is emulated in software—is called a virtual machine (VM).

In many important ways, a virtual machine mimics the behavior of a physical one very closely:

You can turn it on and off, but instead of pressing a physical button, you issue a command, or click a software button.
When a virtual machine is started, it boots up like a physical one.
You can ssh into a running virtual server to manage it, just like you would into a running physical server. You may not even know the machine is being virtualized, and the machine itself may be unaware of it.
Configuring a server to securely provide services to the outside world is a difficult task regardless of whether the server is a physical one or a virtual one.

Virtual machines come with many advantages:

Virtual machines are not tied to the hardware they run on. Should the host fail or be gracefully taken down for maintenance, the VM can be moved (or migrated) to another host with the same CPU architecture.

Virtual machines provide useful isolation. Many failures and security incidents naturally stop at the machine boundary, and to spill over to other machines, they require additional contributing factors, or ingenuity on the attacker’s end. This “containment” is provided by virtual and physical machines alike, to a large extent.

For example, consider an attacker exploiting a known PHP vulnerability and gaining root access to your web server. If your mail server is also running on that same machine, chances are the attacker now has access to all your mail, and the ability to utilize your mail infrastructure to SPAM the world (and that’s a rather benign outcome). On the other hand, when the mail server is running on a separate (physical or virtual) server, the attacker should not gain significant advantage from already being in control of the web server, unless the administrator messed up.

So it’s probably better to have separate web and mail servers. This brings us to our next point:

Virtual machines are “cheap.” Many virtual machines can run simultaneously on the same physical hardware. By running many “smaller” machines, each dedicated to a particular task, we can enjoy the benefits of isolation without having to operate many physical machines for the sake of isolation only.

This leads to substantial savings over time, since you have fewer machines running, consuming electricity through their redundant power supplies, cooling themselves down by warming the planet, and occupying expensive rack space. It also means fewer people are needed whose only job it is to run around and replace faulty components.

The Lab Environment

It should now start to make sense what the four machines 10.10.50.{7,9,10,11} are: they are our physical hosts, and we’ll be building our Linux infrastructure from VMs running on these four machines.

The configuration of each host is as follows:

SuperMicro MicroBlade MBI-6128R-T2X
2 × Intel Xeon EP-2620 v4, 8 cores/16 threads each
128 GB of main memory (8 × 16 GB Micron DDR4 2667 MHz ECC)
/dev/sda 512 GB Seagate Constellation.2 7200 rpm 2.5" (SATA)
/dev/sdb 1 TB Samsung SSD 860 EVO (SATA)
2 × Intel X710 10GbE Ethernet controller (only one is in use)

By today’s standards, the machines are a bit dated (but more than sufficient for our needs). Especially storage has been evolving rapidly in the last 5 years, and NVMe SSD drives are now standard.

The Cloud

This short section aims to demystify “the cloud”. Even though our VMs will be running in our own lab and not “in the cloud”, we will be using several technologies which make the cloud possible. Hardware virtualization is a fundamental building block of cloud computing, so cloud deserves a passing mention.

Cloud is what you get when you let someone else worry about the hardware and instead of owning it yourself, you just rent it from the cloud operator (the most popular ones being AWS (Amazon), Azure (Microsoft), Alibaba Cloud and Google Cloud). However, instead of renting the “bare-metal” physical servers, as would be the case with server hosting, you rent VMs instead.

Through an API call, you can ask the cloud provider to provision a VM for you. Depending on current demand, that takes between a few seconds and a few minutes, after which you’re provided with credentials to access the machine. When you no longer need the VM, you can deallocate it with another API call. Crucially, you are only charged for the period for which you hold the machine, the usual granularity being one second.

(In case you’re wondering, yes, the cloud provider can run out of capacity in a region. It doesn’t happen very often, but it does happen.)

Besides compute (common name for the product category which includes managed VMs), at the minimum the cloud provider also offers storage and networking products—you typically need some persistent storage device connected to your virtual machine, and you usually want the machine to be connected to your other VMs and the Internet.

Cloud computing has several defining characteristics:

You only pay for the resources (compute, storage, networking, …) you asked for. There is no commitment by default.
Usually, from your perspective, the capacity of the cloud appears infinite. It is not of course, but unless you are a really big fish, you will not possibly exceed it.
Provisioning and decommissioning of resources is performed through an API, completely automated and fast.

This has profound implications for many consumers of cloud services. The following screenshot shows the CPU and memory usage of a video streaming platform over a week. As you can see, there is little CPU and memory demand during the night, as few people are using the service. The difference between peak and off-peak usage is almost 9x:

To host such a service, you can either own many more servers than you need 80 % of the time, or you can automatically provision and decommission resources in the cloud depending on the current load of the platform. Not only is the latter option typically cost effective, it’s also a necessary prerequisite for scalability. Should there be a sudden increase in the popularity of your service, you’ll simply request additional resources from the provider. This process is usually automated, and then it’s called autoscaling.

Since hundreds of thousands of customers are using the cloud simultaneously and share the cloud operator’s underlying hardware, isolation is key. (Customers of the cloud, whether they are individuals or businesses, are called tenants.) Several key virtualization technologies are orchestrated by the cloud provider to maintain the illusion that each tenant is running on dedicated hardware:

Since VMs provide a good deal of isolation, a single physical server typically hosts VMs of several tenants. The VMs are densely packed for efficient resource utilization.
Each tenant’s VMs are usually part of the tenant’s network, which is isolated from the other tenants. Those networks are virtual, sharing the networking hardware at the provider’s data center.
Storage is, no surprise, also virtualized on top of shared storage hardware.

It’s important to realize that while it takes a few commands to create a virtual disk and attach it to a virtual machine, it takes a person to install a storage device to a physical server. Cloud, as described, cannot be built from physical components alone.

Please note that this is a narrowed-down definition of what the cloud really is, however it’s a useful one for us: many cloud services are built around the exact same technologies we will be using in this course! And we think that’s cool.

x86 Protection Rings

Before we talk about KVM, we need to understand a bit about how x86 CPUs separate the operating system from other programs and why that is useful.

The job of an operating system is to allow multiple processes to share the underlying machine in a safe and secure manner. To make it possible, the operating system—or more precisely, the kernel of the operating system—needs to remain in control of the hardware, and the “regular” programs must instead access the hardware by requesting service from the kernel with a system call (syscall). The kernel also separates the user space programs from one another. Linux currently supports over 100 different system calls (see syscalls(2)).

This separation of privileges gives rise to the terms kernel space (where the kernel and device drivers run) and user space (where the “regular” programs run). The user space programs are mostly limited to general-purpose computation, and to interact with the machine or other programs they need to make the appropriate syscall.

The isolation of kernel from user space programs relies on features provided by the CPU. On x86, there are four so-called protection rings, or privilege levels. The kernel code typically runs in ring 0 and is largely unrestricted, while the user space programs are confined to ring 3 (rings 1 and 2 are mostly unused)—the higher the ring number, the lower the privilege. This and many other protection mechanisms ensure that the kernel stays in charge at all times: if a user space program tried to mess with hardware or other processes directly, the CPU would prevent that and notify the kernel to deal with the misbehaving process.

Sometimes, ring 0 is called kernel mode or supervisor mode, and ring 3 is called user mode.

Linux, KVM and QEMU

Until now, our discussion was largely theoretical and independent of any particular machine virtualization technology. This section outlines how Linux can be used as a hypervisor on the x86 platform.

The cornerstone of virtualization is the isolation of the host and the guests. The guests must not be able to interfere with the host or the other guests in any way - they must all act as if they were separate physical machines. Originally, x86 CPUs offered no hardware support for virtualization, and it was extremely difficult to implement an x86 hypervisor capable of running VMs at decent speed.

To address that issue, both Intel and AMD have gradually rolled out several extensions to the x86 instruction set, which make it somewhat easier to implement an efficient hypervisor:

Intel: Intel VT-x, Extended Page Tables (EPT), VT-d
AMD: AMD-V, Nested Page Tables (NPT), AMD-Vi

The core of the Intel VT-x extension, for example, is to add two new “modes” of CPU operation, the root mode and the non-root mode. The four protection rings of the CPU remain unchanged and are orthogonal to these new modes: the CPU can be in root mode protection level 3, or non-root mode protection level 0. As you would probably guess, hypervisor code is executing in root mode, and the guests are running in non-root mode. Sometimes, root mode protection level 0 is called ring -1 or hypervisor mode.

To start executing a VM, the hypervisor will switch the CPU from root mode to non-root mode. This is called VM Entry, and it should remind you of what the operating system does when it starts a user space process. The key feature of the non-root mode is that privileged instructions, which could potentially interfere with the hypervisor, switch the CPU from non-root mode back to the root mode. This is called VM Exit. The hypervisor is then provided with detailed information about the offending instruction, so that it can handle it in software. This is similar to what happens when a user space process attempts to perform a privileged operation, and a kernel interrupt is generated to deal with the situation.

The other processor extensions listed above each add hardware virtualization support for other functions of the CPU (such as paging) so that expensive emulation in software can be avoided.

Kernel-based Virtual Machine (KVM) is a Linux kernel module enabling Linux to act as a hypervisor. The KVM module makes it possible to write a user space program which uses the virtualization extensions of the underlying hardware to run a virtual machine.

When the KVM module is loaded into the kernel, it exposes the character device /dev/kvm into the user space:

$ ls -l /dev/kvm
crw-rw---- 1 root kvm 10, 232 Aug 23 23:42 /dev/kvm

By opening this filename, you obtain a file descriptor representing the KVM subsystem of Linux. There are many ioctl calls you can use on the file descriptor, for example KVM_CREATE_VM to create a representation of a virtual machine, or KVM_RUN which performs VM Entry and runs VM code in the non-root mode. When the KVM_RUN ioctl returns, that means that VM Exit occurred and your intervention (you being the hypervisor) is required.

QEMU is one such user space program. It uses the KVM subsystem to run virtual machines. Apart from that, it is also an excellent hardware emulator. KVM alone can only virtualize a CPU, but to have a useful VM, we need much more—at a minimum, we need a serial port so that we can issue commands and read their output. Whenever the guest kernel will try to communicate with any virtual hardware, VM Exit will occur, QEMU will step in to emulate the device in software and then it will resume the virtualization.

Knowing that modern CPUs provide virtualization features, that KVM is a Linux module exposing those features to user-space programs, and that QEMU uses KVM to run VMs and provides emulated hardware, is enough to get you started.

If you’re interested in the hardware details, take a look at Hardware Virtualization: the Nuts and Bolts which describes the state of x86 hardware-assisted virtualization as of 2008. Things have improved since then, but the basic principles described in the article remain the same.

Using KVM + QEMU to Run VMs

As mentioned before, QEMU will be both our interface to the KVM subsystem, and our device emulator. To run the VMs, we’ll use the qemu-system-x86_64 command. As the name suggests, this creates a virtual x86-64 platform (virtual CPU and basic hardware). The command has a lot of options which control many aspects of the virtualization and the types of virtual hardware available to the machine.

The first step is to take a look at qemu(1). Don’t read the entire man page at once—with a complex command such as this, it’s much better to just gloss it over and have a rough idea of the options available.
Read the Arch Wiki QEMU entry and make sure you understand the following chapters:

Installing Arch Linux

Arch Linux comes without a graphical installer (even though there is archinstall). Arch is best installed manually. That way, you understand exactly how your system is set up, and you are able to repair it if it ever breaks. Archiso is your all-in-one installation medium and repair disk for Arch Linux.

To install Arch Linux, follow the Installation guide. The document is fairly shallow and links to many other complex pages in the Wiki, for example Partitioning, GRUB or QEMU (only relevant since we are installing to a VM).

During the lecture, I showed two set-ups

Single partition, BIOS boot + MBR partition table (legacy setup)
EFI system partition + root partition, UEFI boot + GPT partition table using OVMF.

We covered the following steps during the previous lab. I also showed them during the lecture:

Creating an image on the host with fallocate(1):

$ touch img.raw
$ fallocate -l 10G img.raw

Attaching the image to QEMU with -drive (see qemu(1))
Booting into Archiso (we used -cdrom, -m, -accel and optionally -smp)
Partitioning the drive with fdisk(8)
Creating Btfs (mkfs.btrfs(8)) and FAT 32 (mkfs.fat(8)) filesystems
Mounting the file systems (mount(1))

Then, we finished the installation:

Actually installing Linux using pacstrap(8)
Using arch-chroot(8)
Installing GRUB and rebooting into the newly installed system

Repair procedure

I also demonstrated that you can always fix any mistakes you make by simply booting into the Archiso, mounting filesystems, chrooting and correcting the mistake (and unmounting and rebooting).

In UEFI mode, we noticed that the usual -boot order=... QEMU option does not apply (since QEMU isn’t providing the firmware, we are using external OVMF image) so after reboot, we always boot into our installation and never into Archiso. To boot into Archsio, we replaced the OVMF firmware image with the stock one and reinstalled the bootloader as the last step of our repair procedure.

Note that whatever happens to your Arch Linux installation, you can repeat this exact same repair procedure to fix the issue.

Next steps

During the next lab, you will install Arch Linux yourself onto your first VM.