One of the main goals of this course is to learn to manage a Linux server. So,
we need a server to begin with! It would be quite impractical (and expensive)
if everyone were to use a “real” (physical) machine, and we will therefore be
using virtual machines instead.
Before class reading
Hardware virtualization is a complex topic, and this short reading material is
by no means a replacement for a full course (such as NSWI150). The goal is to
give you some initial context around which you will can organize your
knowledge. Please note that we only describe a particular kind of
virtualization called hardware-assisted virtualization, and only as it works on
the x86 platform.
Hardware Virtualization
There are many different types of “virtualization” in computing, only some of
which are relevant to this course. For us, the most important type of
virtualization is hardware virtualization, and for now, that’s the only kind of
virtualization we care about.
The goal of hardware virtualization is to allow multiple operating systems to
execute at the same time on a single physical machine. One of these operating
systems acts as the hypervisor: it monitors the execution of the other
operating systems and intervenes when necessary to make sure that they can
share hardware safely. The physical machine on which the hypervisor runs is
called the host, while the operating systems managed by the hypervisor are
called guests.
The relationship between the hypervisor and the guests is similar to the
relationship between an operating system and the processes (running programs)
it controls—they too need to share the hardware of the machine in a safe and
secure way.
In the simplest case, the hypervisor has exclusive access to the machine’s
peripheral devices (disk controllers, network cards, etc.), while the guests
use emulated hardware. The host guest operating system and the hardware it
runs on—some of which is virtualized with hardware support and some of which
is emulated in software—is called a virtual machine (VM).
In many important ways, a virtual machine mimics the behavior of a physical one
very closely:
- You can turn it on and off, but instead of pressing a physical button, you
issue a command, or click a software button.
- When a virtual machine is started, it boots up like a physical one.
- You can ssh into a running virtual server to manage it, just like you would
into a running physical server. You may not even know the machine is being
virtualized, and the machine itself may be unaware of it.
- Configuring a server to securely provide services to the outside world is a
difficult task regardless of whether the server is a physical one or a
virtual one.
Virtual machines come with many advantages:
Virtual machines are not tied to the hardware they run on. Should the host
fail or be gracefully taken down for maintenance, the VM can be moved (or
migrated) to another host with the same CPU architecture.
Virtual machines provide useful isolation. Many failures and security
incidents naturally stop at the machine boundary, and to spill over to other
machines, they require additional contributing factors, or ingenuity on the
attacker’s end. This “containment” is provided by virtual and physical machines
alike, to a large extent.
For example, consider an attacker exploiting a known PHP vulnerability and
gaining root access to your web server. If your mail server is also running on
that same machine, chances are the attacker now has access to all your mail,
and the ability to utilize your mail infrastructure to SPAM the world (and
that’s a rather benign outcome). On the other hand, when the mail server is
running on a separate (physical or virtual) server, the attacker should not
gain significant advantage from already being in control of the web server,
unless the administrator messed up.
So it’s probably better to have separate web and mail servers. This brings us
to our next point:
Virtual machines are “cheap.” Many virtual machines can run simultaneously
on the same physical hardware. By running many “smaller” machines, each
dedicated to a particular task, we can enjoy the benefits of isolation without
having to operate many physical machines for the sake of isolation only.
This leads to substantial savings over time, since you have fewer machines
running, consuming electricity through their redundant power supplies, cooling
themselves down by warming the planet, and occupying expensive rack space. It
also means fewer people are needed whose only job it is to run around and
replace faulty components.
The Lab Environment
It should now start to make sense what the four machines 10.10.50.{7,9,10,11}
are: they are our physical hosts, and we’ll be building our Linux
infrastructure from VMs running on these four machines.
The configuration of each host is as follows:
- SuperMicro MicroBlade MBI-6128R-T2X
- 2 × Intel Xeon EP-2620 v4, 8 cores/16 threads each
- 128 GB of main memory (8 × 16 GB Micron DDR4 2667 MHz ECC)
/dev/sda
512 GB Seagate Constellation.2 7200 rpm 2.5" (SATA)
/dev/sdb
1 TB Samsung SSD 860 EVO (SATA)
- 2 × Intel X710 10GbE Ethernet controller (only one is in use)
By today’s standards, the machines are a bit dated (but more than sufficient
for our needs). Especially storage has been evolving rapidly in the last 5
years, and NVMe SSD drives are now standard.
The Cloud
This short section aims to demystify “the cloud”. Even though our VMs will be
running in our own lab and not “in the cloud”, we will be using several
technologies which make the cloud possible. Hardware virtualization is a
fundamental building block of cloud computing, so cloud deserves a passing
mention.
Cloud is what you get when you let someone else worry about the hardware and
instead of owning it yourself, you just rent it from the cloud operator (the
most popular ones being AWS (Amazon), Azure (Microsoft), Alibaba Cloud and
Google Cloud). However, instead of renting the “bare-metal” physical servers, as
would be the case with server hosting, you rent VMs instead.
Through an API call, you can ask the cloud provider to provision a VM for you.
Depending on current demand, that takes between a few seconds and a few
minutes, after which you’re provided with credentials to access the machine.
When you no longer need the VM, you can deallocate it with another API call.
Crucially, you are only charged for the period for which you hold the machine,
the usual granularity being one second.
(In case you’re wondering, yes, the cloud provider can run out of capacity in a
region. It doesn’t happen very often, but it does happen.)
Besides compute (common name for the product category which includes
managed VMs), at the minimum the cloud provider also offers storage and
networking products—you typically need some persistent storage device
connected to your virtual machine, and you usually want the machine to be
connected to your other VMs and the Internet.
Cloud computing has several defining characteristics:
- You only pay for the resources (compute, storage, networking, …) you asked
for. There is no commitment by default.
- Usually, from your perspective, the capacity of the cloud appears infinite.
It is not of course, but unless you are a really big fish, you will not
possibly exceed it.
- Provisioning and decommissioning of resources is performed through an API,
completely automated and fast.
This has profound implications for many consumers of cloud services. The
following screenshot shows the CPU and memory usage of a video streaming
platform over a week. As you can see, there is little CPU and memory demand
during the night, as few people are using the service. The difference between
peak and off-peak usage is almost 9x:
To host such a service, you can either own many more servers than you need 80 %
of the time, or you can automatically provision and decommission resources in
the cloud depending on the current load of the platform. Not only is the latter
option typically cost effective, it’s also a necessary prerequisite for
scalability. Should there be a sudden increase in the popularity of your
service, you’ll simply request additional resources from the provider. This
process is usually automated, and then it’s called autoscaling.
Since hundreds of thousands of customers are using the cloud simultaneously and
share the cloud operator’s underlying hardware, isolation is key. (Customers of
the cloud, whether they are individuals or businesses, are called tenants.)
Several key virtualization technologies are orchestrated by the cloud provider
to maintain the illusion that each tenant is running on dedicated hardware:
- Since VMs provide a good deal of isolation, a single physical server
typically hosts VMs of several tenants. The VMs are densely packed for
efficient resource utilization.
- Each tenant’s VMs are usually part of the tenant’s network, which is isolated
from the other tenants. Those networks are virtual, sharing the networking
hardware at the provider’s data center.
- Storage is, no surprise, also virtualized on top of shared storage hardware.
It’s important to realize that while it takes a few commands to create a
virtual disk and attach it to a virtual machine, it takes a person to install a
storage device to a physical server. Cloud, as described, cannot be built from
physical components alone.
Please note that this is a narrowed-down definition of what the cloud really
is, however it’s a useful one for us: many cloud services are built around the
exact same technologies we will be using in this course! And we think that’s cool.
x86 Protection Rings
Before we talk about KVM, we need to understand a bit about how x86 CPUs
separate the operating system from other programs and why that is useful.
The job of an operating system is to allow multiple processes to share the
underlying machine in a safe and secure manner. To make it possible, the
operating system—or more precisely, the kernel of the operating
system—needs to remain in control of the hardware, and the “regular” programs
must instead access the hardware by requesting service from the kernel with a
system call (syscall). The kernel also separates the user space programs
from one another. Linux currently supports over 100 different system calls (see
syscalls(2)).
This separation of privileges gives rise to the terms kernel space (where
the kernel and device drivers run) and user space (where the “regular”
programs run). The user space programs are mostly limited to general-purpose
computation, and to interact with the machine or other programs they need to
make the appropriate syscall.
The isolation of kernel from user space programs relies on features provided by
the CPU. On x86, there are four so-called protection rings, or
privilege levels. The kernel code typically runs in ring 0 and is largely
unrestricted, while the user space programs are confined to ring 3 (rings 1 and
2 are mostly unused)—the higher the ring number, the lower the privilege. This
and many other protection mechanisms ensure that the kernel stays in charge at
all times: if a user space program tried to mess with hardware or other
processes directly, the CPU would prevent that and notify the kernel to deal
with the misbehaving process.
Sometimes, ring 0 is called kernel mode or supervisor mode, and ring 3
is called user mode.
Linux, KVM and QEMU
Until now, our discussion was largely theoretical and independent of any
particular machine virtualization technology. This section outlines how Linux
can be used as a hypervisor on the x86 platform.
The cornerstone of virtualization is the isolation of the host and the guests.
The guests must not be able to interfere with the host or the other guests in
any way - they must all act as if they were separate physical machines.
Originally, x86 CPUs offered no hardware support for virtualization, and it was
extremely difficult to implement an x86 hypervisor capable of running VMs at
decent speed.
To address that issue, both Intel and AMD have gradually rolled out several
extensions to the x86 instruction set, which make it somewhat easier to
implement an efficient hypervisor:
- Intel: Intel VT-x, Extended Page Tables (EPT), VT-d
- AMD: AMD-V, Nested Page Tables (NPT), AMD-Vi
The core of the Intel VT-x extension, for example, is to add two new “modes” of
CPU operation, the root mode and the non-root mode. The four protection
rings of the CPU remain unchanged and are orthogonal to these new modes: the
CPU can be in root mode protection level 3, or non-root mode protection level
0. As you would probably guess, hypervisor code is executing in root mode, and
the guests are running in non-root mode. Sometimes, root mode protection level
0 is called ring -1 or hypervisor mode.
To start executing a VM, the hypervisor will switch the CPU from root mode to
non-root mode. This is called VM Entry, and it should remind you of what
the operating system does when it starts a user space process. The key feature
of the non-root mode is that privileged instructions, which could potentially
interfere with the hypervisor, switch the CPU from non-root mode back to the
root mode. This is called VM Exit. The hypervisor is then provided with
detailed information about the offending instruction, so that it can handle it
in software. This is similar to what happens when a user space process attempts
to perform a privileged operation, and a kernel interrupt is generated to deal
with the situation.
The other processor extensions listed above each add hardware virtualization
support for other functions of the CPU (such as paging) so that expensive
emulation in software can be avoided.
Kernel-based Virtual Machine (KVM) is a Linux kernel module enabling
Linux to act as a hypervisor. The KVM module makes it possible to write a user
space program which uses the virtualization extensions of the underlying
hardware to run a virtual machine.
When the KVM module is loaded into the kernel, it exposes the character
device /dev/kvm
into the user space:
$ ls -l /dev/kvm
crw-rw---- 1 root kvm 10, 232 Aug 23 23:42 /dev/kvm
By opening this filename, you obtain a file descriptor representing the KVM
subsystem of Linux. There are many ioctl calls you can use on the file
descriptor, for example KVM_CREATE_VM
to create a representation of a virtual
machine, or KVM_RUN
which performs VM Entry and runs VM code in the non-root
mode. When the KVM_RUN
ioctl returns, that means that VM Exit occurred and
your intervention (you being the hypervisor) is required.
QEMU is one such user space program. It uses the KVM subsystem to run
virtual machines. Apart from that, it is also an excellent hardware emulator.
KVM alone can only virtualize a CPU, but to have a useful VM, we need much
more—at a minimum, we need a serial port so that we can issue commands and
read their output. Whenever the guest kernel will try to communicate with any
virtual hardware, VM Exit will occur, QEMU will step in to emulate the device
in software and then it will resume the virtualization.
Knowing that modern CPUs provide virtualization features, that KVM is a Linux
module exposing those features to user-space programs, and that QEMU uses KVM
to run VMs and provides emulated hardware, is enough to get you started.
If you’re interested in the hardware details, take a look at Hardware
Virtualization: the Nuts and Bolts which describes the state of x86
hardware-assisted virtualization as of 2008. Things have improved since then,
but the basic principles described in the article remain the same.
Using KVM + QEMU to Run VMs
As mentioned before, QEMU will be both our interface to the KVM subsystem, and
our device emulator. To run the VMs, we’ll use the qemu-system-x86_64
command.
As the name suggests, this creates a virtual x86-64 platform (virtual CPU and
basic hardware). The command has a lot of options which control many aspects of
the virtualization and the types of virtual hardware available to the machine.
- The first step is to take a look at qemu(1). Don’t read the entire
man page at once—with a complex command such as this, it’s much better to
just gloss it over and have a rough idea of the options available.
- Read the Arch Wiki QEMU entry and make sure you understand the
following chapters:
I also demonstrated that you can always fix any mistakes you make by simply
booting into the Archiso, mounting filesystems, chrooting and correcting the
mistake (and unmounting and rebooting).
Note that whatever happens to your Arch Linux installation, you can repeat this
exact same repair procedure to fix the issue.