Labs: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.

In this lab we will look at two big topics. First we will look at utilities related to storage management and then we will explore how to develop Python projects in a sandboxed environment that is easily distributed among individual developers in a big software team. In between, we will squeeze in a quick note about file compression (and archival) on Linux systems.

Storage management

Before proceeding, recall that files reside on file systems that are the structures on the actual block devices (typically, disks).

Working with file systems and block devices is necessary when installing a new system, rescuing from a broken device, or simply checking available free space.

You are already familiar with normal files and directories. But there are other types of files that you can find on a Linux system.

Linux allows to create a symbolic link to another file. This special file does not contain any content by itself and merely points to another file.

An interesting feature of a symbolic link is that it is transparent to standard file I/O API. If you call Pythonic open on a symbolic link, it will transparently open the file the symbolic link points to. That is the intended behavior.

The purpose of symbolic links is to allow different perspectives on the same files without need for any copying and synchronization.

For example, a movie player is able to play only files in directory Videos. However, you actually have the movies elsewhere because they are on a shared hard drive. With the use of a symbolic link, you can make Videos a symbolic link to the actual storage and make the player happy. (For the record, we do not know about any movie player with such behaviour, but there are plenty of other programs where such magic can make them work in a complex environment they were not originally designed for.)

Note that a symbolic link is something else than what you may know as Desktop shortcut or similar. Such shortcuts are actually normal files where you can specify which icon to use and also contain information about the actual file. Symbolic links operate on a lower level.

Special files

There are also other special files that represent physical devices or files that serve as a spy-hole into the state of the system.

The reason is that it is much simpler for the developer that way. You do not need special utilities to work with a disk, you do not need a special program to read the amount of free memory. You simply read the contents of a well-known file and you have the data.

It is also much easier to test such programs because you can easily give them mock files by changing the file paths – a change that is unlikely to introduce a serious bug into the program.

Usually Linux offers the files that reveal state of the system in a textual format. For example, the file /proc/meminfo can look like this:

MemTotal:        7899128 kB
MemFree:          643052 kB
MemAvailable:    1441284 kB
Buffers:          140256 kB
Cached:          1868300 kB
SwapCached:            0 kB
Active:           509472 kB
Inactive:        5342572 kB
Active(anon):       5136 kB
Inactive(anon):  5015996 kB
Active(file):     504336 kB
Inactive(file):   326576 kB
...

This file is nowhere on the disk but when you open this path, Linux creates the contents on the fly.

Notice how the information is structured: it is a textual file, so reading it requires no special tools and the content is easily understood by a human. On the other hand, the structure is quite rigid: each line is a single record, keys and values are separated by a colon. Easy for machine parsing as well.

File system hierarchy

We will now briefly list some of the key files you can find on virtually any Linux machine.

Do not be afraid to actually display contents of the files we mention here. hexdump -C is really a great tool.

/boot contains the bootloader for loading the operating system. You would rarely touch this directory once the system is installed.

/dev is a very special directory where hardware devices have their file counterparts. You will probably see there a file sda or nvme0 that represents your hard (or SSD) drive. Unless you are running under a superuser, you will not have access to these files, but if you would hexdump them, you would see the bytes as they are on the actual hard drive.

And writing to such files would overwrite the data on your drive!
The fact is that disk utilities in Linux accept paths to the disk drives they will operate on. Thus it is very easy to give it a file and pretend that it is a disk to be formatted. That can be used to create disk images or for file recovery. And it greatly simplifies the testing of such tools because you do not need to have a real disk for testing.

It is important to note that these files are not physical files on your disk (after all, it would mean having a disk inside a disk). When you read from them, the kernel recognizes that and returns the right data.

This directory also contains several special but very useful files for software development.

/dev/urandom returns random bytes indefinitely. It is probably internally used inside your favorite programming language to implement its random() function. Try to run hexdump on this file (and recall that <Ctrl>-C will terminate the program once you are tired of the randomness).

There is also /dev/full that emulates a full disk, /dev/null that discards everything written to it or /dev/zero that supplies an infinite stream of zero bytes.

/etc/ contains system-wide configuration. Typically, most programs in UNIX systems are configured via text files. The reasoning is that an administrator needs to learn only one tool – a good text editor – for system management. The advantage is that most configuration files have support for comments and it is possible to comment even on the configuration. For an example of such a configuration file, you can have a look at /etc/systemd/system.conf to get the feeling.

Perhaps the most important file is /etc/passwd that contains a list of user accounts. Note that it is a plain text file where each row represents one record and individual attributes are simply separated by a colon :. Very simple to read, very simple to edit, and very simple to understand. In other words, the KISS principle in practice.

/home contains home directories for normal user accounts (i.e., accounts for real – human – users).

/lib and /usr contain dynamic libraries, applications, and system-wide data files.

/var is for volatile data. If you would install a database or a web server on your machine, its files would be stored here.

/tmp is a generic location for temporary files. This directory is automatically cleaned at each reboot, so do not use it for permanent storage. Many systems also automatically remove files which were not modified in the last few days.

/proc is a virtual file system that allows controlling and reading of kernel (operating system) settings. For example, the file /proc/meminfo contains quite detailed information about RAM usage.

Again, /proc/* are not normal files, but virtual ones. Until you read them, their contents do not exist physically anywhere.

When you open /proc/meminfo, the kernel will read its internal data structures, prepare its content (in-memory only), and give it to you. It is not that this file would be physically written every 5 seconds or so to contain the most up-to-date information.

Mounts and mount-points

Each file system (that we want to access) is accessible as a directory somewhere (compared to a drive letter in other systems, for example).

When we can access /dev/sda3 under /home we say that /dev/sda3 is mounted under /home, /home is then called the mount point, /dev/sda3 is often called a volume.

Most devices are mounted automatically during boot. This includes / (root) where the system is as well as /home where your data reside. File systems under /dev or /proc are actually special file systems that are mounted to these locations. Hence, the file /proc/uptime does not physically exist (i.e., there is no disk block with its content anywhere on your hard drive) at all.

The file systems that are mounted during boot are listed in /etc/fstab. You will rarely need to change this file on your laptop and this file was created for you during installation. Note that it contains volume identification (such as path to the partition), the mount point and some extra options.

When you plug-in a removable USB drive, your desktop environment will typically mount it automatically. Mounting it manually is also possible using the mount utility. However, mount has to be run under root to work (this thread explains several aspects why mounting a volume could be a security risk). Therefore, you need to play with this on your installations where you can become root. It will not work on any of the shared machines.

Technical note: the above text may seem contradictory, as mount requires root password yet your desktop environment (DE) may mount the drive automatically without asking for any password. Internally, your DE does not call mount, but it talks to daemons called Udisks and Polkit which run with root privileges. The daemons together verify that the mounted device is actually a removable one and that the user is a local one (i.e., it will not work over SSH). If these conditions are satisfies, it mounts the disk for the given user. By the way, you can talk to Udisks from the shell using udisksctl.

To test the manual mounting, plug-in your USB device and unmount it in your GUI if it was mounted automatically (note that the usual path the device is mounted is somewhere under /media).

Your USB will probably be available as /dev/sdb1 or /dev/sda1 depending what kind of disk you have (consult the following section about lsblk to view the list of drives).

Mounting disks is not limited to physical drives only. We will talk about disk images in the next section but there are other options, too. It is possible to mount a network drive (e.g., NFS or AFS used in MFF labs) or even create a network block device and then mount it.

If you are running virtualized Linux, e.g. inside VirtualBox, mounting disks is a bit more complex. You can attach another virtual disk to it and mount it manually Or you can create a so called pass-through and let the virtual machine access your physical drive directly. For example, in VirtualBox, it is possible to access physical partition of a real hard-drive but for experimenting it is probably safer to start with a USB pass-through that makes available your USB pendrive inside the guest. But always make sure that the physical device is not used by the host.

Working with disk images

Linux has built-in support for working with disk images. That is, with files with content mirroring a real disk drive. As a matter of fact, you probably already worked with them when you set up Linux in a virtual machine or when you downloaded the USB disk image at the beginning of the semester.

Linux allows you to mount such image as if it was a real physical drive and modify the files on it. That is essential for the following areas:

  • Developing and debugging file systems (rare)
  • Extracting files from virtual machine hard drives
  • Recovering data from damaged drives (rare, but priceless)
When recovering data from damaged drives, the typical approach is to try to copy the data from the file as-is on the lowest level possible (typically, copying the raw bytes without interpreting them as a file system or actual files). Only after you recover the disk (mirror) image, you run the actual recovery tools on the image. That prevents further damage to the hard drive and gives you a plenty of time for the actual recovery.

In all cases, to mount the disk image we need to tell the system to access the file in the same way as it accesses other block devices (recall /dev/sda1 from the example above).

Mounting disks manually

sudo mkdir /mnt/flash
sudo mount /dev/sdb1 /mnt/flash

Your data shall be visible under /mnt/flash.

To unmount, run the following command:

sudo umount /mnt/flash

Note that running mount without any arguments prints a list of currently active mounts. For this, root privileges are not required.

Specifying volumes

So far, we always used the name of the block device (e.g., /dev/sdb1) to specify the volume. While this is trivial on small systems, it can be incredibly confusing on larger ones – device names depend on the order in which the system discovered the disks. This order can vary between boots and it is even less stable with removable drives. You do not want to let a randomly connected USB flash disk render your machine non-bootable :-).

A more stable way is to refer to block devices using symlinks named after the physical location in the system. For example, /dev/disk/by-path/pci-0000:03:00.1-ata-6-part1 refers to partition 1 of a disk connected to port 6 of a SATA controller which resides as device 00.1 on PCI bus 0000:03.

In most cases, it is even better to describe the partition by its contents. Most filesystems have a UUID (universally unique identifier, a 128-bit number, usually randomly generated) and often also a disk label (a short textual name). You can run lsblk -f to view UUIDs and labels of all partitions and then call mount with UUID=number or LABEL=name instead of the block device name. Your /etc/fstab will likely refer to your volumes in one of these ways.

Mounting disk images

Disk images can be mounted in almost the same way as block devices, you only have to add the -o loop option to mount.

Recall that mount requires root (sudo) privileges hence you need to execute the following example on your own machine, not on any of the shared ones.

To try that, you can download this FAT image and mount it.

sudo mkdir /mnt/photos-fat
sudo mount -o loop photos.fat.img /mnt/photos-fat
... (work with files in /mnt/photos-fat)
sudo umount /mnt/photos-fat

Alternatively, you can run udisksctl loop-setup to add the disk image as a removable drive that could be automatically mounted in your desktop:

# Using udisksctl and auto-mounting in GUI
udisksctl loop-setup -f fat.img
# This will probably print /dev/loop0 but it can have a different number
# Now mount it in GUI (might happen completely automatically)
... (work with files in /run/media/$(whoami)/07C5-2DF8/)
udisksctl loop-delete -b /dev/loop0

Disk space usage utilities

The basic utility for checking available disk space is df (disk free).

Filesystem     1K-blocks    Used Available Use% Mounted on
devtmpfs         8174828       0   8174828   0% /dev
tmpfs            8193016       0   8193016   0% /dev/shm
tmpfs            3277208    1060   3276148   1% /run
/dev/sda3      494006272 7202800 484986880   2% /
tmpfs            8193020       4   8193016   1% /tmp
/dev/sda1        1038336  243188    795148  24% /boot

In the default execution (above), it uses one-kilobyte blocks. For a more readable output, run it with -BM or -BG (megas and gigas) or with -h to let it select the most suitable unit.

Do not confuse df with du which can be used to estimate file space usage. Typically, you would run du as du -sh DIR to print total space occupied by all files in DIR. You could use du -sh ~/* to print summaries for top-level directories in your $HOME. But be careful as it can take quite some time to scan everything.

Also, you can observe that the space usage reported by du is not equal to the sum of all file sizes. This happens because files are organized in blocks, so file sizes are typically rounded to a multiple of the block size. Besides that, directories also consume some space.

To see how volumes (partitions) are nested and which block devices are recognized by your kernel, you can use lsblk. On the shared machine, the following will appear:

NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0   480G  0 disk
├─sda1   8:1    0     1G  0 part /boot
├─sda2   8:2    0   7.9G  0 part [SWAP]
└─sda3   8:3    0 471.1G  0 part /

This shows that the machine has a 480G disk divided into three partitions: a tiny /boot for boostrapping the system, a 8G swap partition, and finally 470G left for system and user data. We are not using a separate volume for /home.

You can find many other output formats in the man page.

Inspecting and modifying volumes (partitions)

We will leave this topic to a more advanced course. If you wish to learn by yourself, you can start with the following utilities:

  • fdisk(8)
  • btrfs(8)
  • mdadm(8)
  • lvm(8)

File archiving and compression

A somewhat related topic to the above is how Linux handles file archival and compression.

Archiving on Linux systems typically refers to merging multiple files into one (for easier transfer) and compression of this file (to save space). Sometimes, only the first step (i.e., merging) is considered archiving.

While these two actions are usually performed together, Linux keeps the distinction as it allows combination of the right tools and formats for each part of the job. Note that on other systems where the ZIP file is the preferred format, these actions are blended into one.

The most widely used program for archiving is tar. Originally, its primary purpose was archiving on tapes, hence the name: tape archiver. It is always run with an option specifying the mode of operation:

  • -c to create a new archive from existing files,
  • -x to extract files from the archive,
  • -t to print the table of files inside the archive.

The name of the archive is given via the -f option; if no name is specified, the archive is read from standard input or written to standard output.

As usually, the -v option increases verbosity. For example, tar -cv prints names of files added to the archive, tar -cvv prints also file attributes (like ls -l). (Everything is printed to stderr, so that stdout can be still used for the archive.) Plain tar -t prints only file names, tar -tv prints also file attributes.

An uncompressed archive can be created this way:

tar -cf archive.tar dir_to_archive/

A compressed archive can be created by piping the output of tar to gzip:

tar -c dir_to_archive/ | gzip >archive.tar.gz

As this is very frequent, tar supports a -z switch, which automatically calls gzip, so that you can write:

tar -czf archive.tar.gz dir_to_archive/

tar has further switches for other (de)compression programs: bzip2, xz, etc.. Most importantly, the -a switch chooses the (de)compression program according to the name of the archive file.

If you want to compress a single file, plain gzip without tar is often used. Some tools or APIs can even process gzip-compressed files transparently.

To unpack an archive, you can again pipe gzip -d (decompress) to tar, or use -z as follows:

tar -xzf archive.tar.gz
Like many other file-system related programs, tar will overwrite existing files without any warning.

We recommend to install atool as a generic wrapper around tar, gzip, unzip and plenty of other utilities to simplify working with archives. For example:

apack archive.tar.gz dir_to_archive/
aunpack archive.tar.gz

Note that atool will not overwrite existing files by default (which is another very good reason for using it).

It is a good practice to always archive a single directory. That way, user that unpacks your archive will not have your files scattered in the current directory but neatly prepared in a single new directory.

To view the list of files inside an archive, you can execute als.

Sandboxed software development

During the previous lab, we showed that the preferred way of installing applications (and libraries and data files) on Linux is via the package manager. It installs the application for all users, it allows system-wide upgrades, and it generally keeps your system in a much cleaner state.

However, system-wide installation may not be always suitable. One typical example are project-specific dependencies. These are often not installed system-wide, mainly for the following reasons:

  • You need different versions of dependencies for different projects.
  • You do not want to remember to uninstall them when you stop working on the project.
  • You want to control when you upgrade them: an upgrade of the OS should not affect your project.
  • The versions you need are different from those available through the package manager.
  • Or they may not be packaged at all.

For the above reasons, it is much better to create a project-specific installation that is better isolated from the system. Note that installing the dependency per-user (i.e., somewhere into $HOME) may not provide the isolation you wish to achieve.

Such approach is supported by most reasonable programming languages and can be usually found under names such as virtual environment, local repository, sandbox or similar (note that the concepts do not map 1:1 across languages and tools, but the general idea remains the same).

With a virtual environment, your dependencies are usually installed into a specific directory inside your project, kept outside version control. The compiler/interpreter is then told to use this location.

The directory-local installation then keeps your system clean. It also allows working on multiple projects with incompatible dependencies, because they are completely isolated.

The installation directory is rarely committed to your Git repository. Instead, you commit a configuration file that specifies how to prepare the environment.

Each developer can then recreate the environment without polluting the main repository with distribution-specific or even OS-dependent files. Yet the configuration file ensures that all developers will be working in the same environment (i.e., same versions of all the dependencies).

It also means that new members of software teams can easily set up their environment using the provided configuration file.

Dependency installation

Inside the virtual environment, the project usually does not use generic package managers (such as DNF). Instead, they install dependencies using language-specific package managers.

These are usually cross-platform and use their own software repository. Such repository then hosts only libraries for that particular language. Again, there can be multiple such repositories and it is up to the developers how they configure their projects

Technically, language-specific package managers can also install the packages system-wide, competing with distribution-specific package managers. It is up to the administrator to handle this reasonably. This usually involves defining a clear boundary between areas maintained by the distribution-specific manager and those maintained by the language-specific ones.

In our scenario, the language-specific managers would install only into the virtual environment directory without ever touching the system itself.

Installation directories

On a typical Linux system, there are multiple places where software can be installed:

  • /usr – system packages handled by the distribution’s package manager
  • /usr/local – software installed locally by the administrator; language-specific managers usually install system-wide packages there
  • /opt/$PACKAGE – large packages installed outside distribution’s package manager often live in their own sub-directory inside /opt.
  • $HOME (usually /home/$USER/) – language-specific managers run by non-root users can install packages locally to their home directory (to language-specific sub-directories).
  • $HOME/.local is a favourite place for local installation that generally mirrors /usr/local but for a single user only (executables are then placed inside $HOME/.local/bin)
  • per-project virtual environments

Python Package Index (PyPI)

The rest of the text will focus mostly on Python tools supporting the above-mentioned principles. Similar tools are available for other languages, but we believe that demonstrating them on Python is sufficient to understand the principles in practice.

Python has a repository called the Python Package Index (PyPI) where anyone can publish their Python programs and/or libraries.

The repository can be used through a web browser, but also through a command-line client called pip.

pip behaves rather similar to DNF. You can use it to install, upgrade, or uninstall Python modules.

When run with superuser privileges, it is able to install packages system-wide. Do not use it like that unless you know what you are doing and you understand the consequences.

Issues of trust

In your distributions upstream package repository, all packages typically has to be reviewed by someone from the distribution’s security team. This is sadly not true for the PyPI or similar repositories. This said, you as a developer must be more cautious when installing from such sources.

Not all packages do what they claim to. Some are just innocently buggy, but some are outright malicious. Re-using other people’s code is generally a good practice, but you should give a thought to the trustworthiness of the author. After all, the code will be executed under your account either when you run your program or as a part of the installation process.

In particular, criminals like to publish malicious packages, whose name differs from a well-known package by a single typo. This is called typosquatting. You might read more for example in this blogpost, but searching the web will yield more results.

On the other hand, many PyPI packages are also available as packages for your distribution (feel free to try dnf search python3- on your Fedora box). Hence they probably were reviewed by distribution maintainers and are probably safe to use. For packages not available for your distribution natively, always look for tell-tales of normal vs malicious project. Popularity of the source code repository. User activity. Reactions to bug reports. Documentation quality. Etc. etc.

Recall that modern software is rarely built from scratch. Do not be afraid to explore what is available. Check it. And use it :-).

Typical workflow practically

While the actual tools will differ across different programming languages, the general steps for developing project in some kind of a sandbox are generally the same.

  1. The developer clones the project (e.g., from a Git repository).
  2. The sandbox (virtual environment) is initialized. Usually this means that a new directory with a fresh language environment is created.
  3. The virtual environment must be activated. Often the virtual environment needs to modify $PATH (or rather some language-specific variant of such path that is used to search for libraries or modules), so the developer must source (or .) some activation script that modifies the path.
  4. Then the developer can install dependencies of the project. They are usually stored in a file that can be passed to the package manager (of the given programming language).
  5. Only now the developer can actually work on the project. The project is fully isolated, removing the virtual environment directory removes all traces of the installed packages.

Everyday job then often involves only steps 3 (some kind of activation) and step 5 (actual development).

Note that activation of the virtual environment typically removes access to libraries installed globally. That is, inside the virtual environment, the developer starts with a fresh and clean environment with a bare compiler. That is actually a very sane decision as it ensures that system-wide installation does not affect the project-specific environment.

In other words, it improves on reproducibility of the whole setup. It also means that the developer needs to specify every dependency into the configuration file even if the dependency can be considered as one of those that are usually present everywhere.

Virtual environment for Python (a.k.a. virtualenv or venv)

To try installing Python packages safely, we will first setup a virtual environment for our project. Fortunately, Python has built-in support for creating a virtual environment.

We will demonstrate this on the following example:

#!/usr/bin/env python3

import argparse
import shutil
import sys

import fs

class FsCatException(Exception):
    pass

def fs_cat(filesystem, filename, target):
    try:
        with fs.open_fs(filesystem) as my_fs:
            try:
                with my_fs.open(filename, 'rb') as my_file:
                    shutil.copyfileobj(my_file, target)
            except fs.errors.FileExpected as e:
                raise FsCatException(f"{filename} on {filesystem} is not a regular file") from e
            except fs.errors.ResourceNotFound as e:
                raise FsCatException(f"{filename} does not exist on {filesystem}") from e
    except Exception as e:
        if isinstance(e, FsCatException):
            raise e
        raise FsCatException(f"unable to read {filesystem}, perhaps misspelled path or protocol ({e})?") from e


def main():
    args = argparse.ArgumentParser(description='Filesystem cat')
    args.add_argument(
        'filesystem',
        nargs=1,
        metavar='FILESYSTEM',
        help='Filesystem specification, e.g. tar://path/to/file.tar'
    )
    args.add_argument(
        'filename',
        nargs=1,
        metavar='FILENAME',
        help='File path on FILESYSTEM, e.g. /README.md'
    )

    config = args.parse_args()

    try:
        fs_cat(config.filesystem[0], config.filename[0], sys.stdout.buffer)
    except FsCatException as e:
        print(f"Fatal: {e}", file=sys.stderr)
        sys.exit(1)

if __name__ == "__main__":
    main()

Save this snippet into fscat.py and set the executable bit. Note that fs.open_fs is able to open various filesystems and access files on them like if you use the builtin Pythonic open.

In our program, we provide path to a filesystem and a file (residing on this filesystem) to print to the screen (hence the name, fscat as it simulates cat inside a different filesystem).

Make sure you understand the whole program before continuing.

Try running the fscat.py program.

Unless you have already installed the python3-fs package system-wide, it should fail with ModuleNotFoundError: No module named 'fs'. The chances are that you do not have that module installed.

If you have installed the python3-fs, uninstall it now and try again (just for this demo). But double-check that you would not remove some other program that may require it.

We could now install the python3-fs with DNF but we already described why that is a bad idea. We could also install it with pip globally but that is not the best course of action either.

Instead, we will create a new virtual environment for it.

python3 -m venv my-venv

The above command creates a new directory my-venv that contains a bare installation of Python. Feel free to investigate the contents of this directory.

We now need to activate the environment.

source my-venv/bin/activate

Your prompt should have changed: it is prefixed by (my-venv) now.

Running fscat.py will still terminate with ModuleNotFoundError.

We will now install the dependency:

pip install fs

This will take some time as Python will also download transitive dependencies of this library (and their dependencies etc.). Once the installation finishes, run fscat.py again.

This time, it should work.

./fscat.py

Okay, it printed an error message about required arguments. Download this tarball and run the script as follows:

./fscat.py tar://test.tar.gz testdir/test.txt

It should print Test string as it is able to even handle tarballs as filesystems and print files on them (verify that the file is really there using either atool, MC or using tar directly).

Once we are finished with the development, we can deactivate the environment by calling deactivate (this time, without sourcing anything).

Running fscat.py outside the environment shall again terminate with ModuleNotFoundError.

How does it work?

Python virtual environment uses two tricks in its implementation.

First, the activate script extends $PATH with the my-venv/bin directory. That means that calling python3 will prefer the application from the virtualenv’s directory (e.g. my-venv/bin/python3).

Try this yourself: print $PATH before and after you activate a virtualenv.

This also explains why we should always specify /usr/bin/env python3 in the shebang instead of /usr/bin/python3. env will consult $PATH that was modified by the activation of the virtualenv.

You can also view the activate script and see how this is implemented. Note that deactivate is actually a function.

Why is the activate script not executable? Hint.

The second trick is that Python searches for modules (i.e., for files implementing an imported module) relative to the path of the python3 binary. Hence, when python3 is inside my-venv/bin, Python will look for the modules inside my-venv/lib. That is the location where your locally installed files will be placed.

You can check this by executing the following one-liner that prints Python search directories (again, before and after activation):

python3 -c 'import sys; print(sys.path)'

This behaviour is actually not hard-wired in the Python interpreter. When Python starts up, it automatically imports a module called site. This module contains site-specific setup: it adjusts sys.path to include all directories where your distribution installs Python modules. It also detects virtual environments by looking for the pyvenv.cfg file in the grandparent directory of the python3 binary. In our case, this configuration file contains include-system-site-packages=false, which tells the site module to skip distribution’s module directories. You can see that the principle is very simple and the interpreter itself needs to know nothing about virtual environments.

Installing Python-specific packages with pip

pip VS. python3 -m pip?

Generally, it is recommended to use python3 -m pip, rather than raw pip. Reasons behind these additional 10 key strokes are well described in Why you should use python3 -m pip. However, in order to make the following text more readable, we will use the shorter pip variant.

We have already seen one usage of pip in practice, but pip can do much more. The nice walkthrough over all pip capabilities can be found in Using Python’s pip to Manage Your Projects’ Dependencies.

Here we provide a brief summary of the most important concepts and commands.

By default, pip install is searching through the package registry PyPI, in order to install the package specified in the command-line. We wouldn’t be far from truth, by saying that all packages inside this registry are just archived directories, which contain Python source code organized in a prescribed way.

If you would like to change this default package registry, you can use the --index-url argument.

As you are already familiar with GitLab, you could be interested in GitLab PyPI Package Registry Support.

In a later section, we will learn how to turn a directory with code into a proper Python package. Assuming that we have already done it, we can install that package directly (without archiving/packing) by running pip install /path/to/python_package.

For example, imagine a situation where you are interested in a third-party open-source package. This package is available in a remote git repository (typically on GitHub or GitLab), but it is NOT packed and published in PyPI. You can simply clone the repository and run pip install .. However, thanks to pip VCS Support, you can avoid the cloning phase and install the package directly with:

pip install git+https://git.example.com/MyProject

In order to upgrade a specific package, you run pip install --upgrade [packages].

Finally, for removing package you run pip uninstall [packages].

Dependency versioning

You might have heard about semantic versioning. Python uses a more or less compatible versioning, which is described in PEP 440 – Version Identification and Dependency Specification.

When you install dependencies from the package registry, you can specify this version.

pkgname          # latest version
pkgname == 4.2   # specific version
pkgname >= 4.2   # minimal version
pkgname ~= 4.2   # equivalent to >= 4.2, == 4.*

Truth is that a version specifier consists of a series of version clauses, separated by commas. Therefore you can type:

pkgname >= 1.0, != 1.3.4.*, < 2.0

Sometimes it is helpful to save a list of all currently installed packages (including transitive dependencies). For example, you have recently noticed a new bug in your project and you would like to keep record of the precise version of currently installed dependencies, so that your co-worker can reproduce the bug.

In order to do that, it is possible to use pip freeze and create a list that sets specific versions, ensuring the same environment for every developer.

It is recommended to store these in requirements.txt file.

# Generating requirements file
pip freeze > requirements.txt

# Installing package from it
pip install -r requirements.txt

Packaging Python Projects

Let’s say that you come up with a super cool algorithm and you want to enrich the world by sharing it. Python official documentation offers a step-by-step tutorial on how to achieve it.

In following text, we are going to use setuptools for building the Python projects. Historically, this was the only option how to build a Python package. Recently, Python developers decided to open gates for alternatives and so you may also build a Python package with Poetry, flit or others. The description of these tools is out of the scope of this course.

Python Package Directory Structure

The very first step, before you can publish it, is to transform it into a proper Python package. We need to create files called pyproject.toml and setup.cfg. These files contain information about the project, a list of dependencies, and also information for project installation.

Not long ago, it was usual to have setup.py script, rather that setup.cfg and pyproject.toml. Therefore, in many repositories/tutorials you can still find usage of it. The content is more or less 1:1, but there are certain cases, in which you are forced to use setup.py. Fortunately, this is not applicable for our usecase and so we have decided to describe the modern variant with static configuration files.
As is written in setuptools Quickstart, since version 61.0.0, setuptools offers the experimental usage of having only a pyproject.toml. This approach is also used by Poetry, but in the following text, we will stay with the stable combination of setup.cfg and pyproject.toml.

In fscat, you can find a Python package with the same functionality as our previous fscat.py script.

Please study carefully the directory structure as well as the content of setup.cfg.

One may notice that the necessary dependencies are duplicated in setup.cfg and in requirements.txt. Actually, this is not a mistake. In setup.cfg, you should use the most possible relaxed version of the dependency, whereas in requirements.txt we need to specify all dependencies with a precise version. There are also the transitive dependencies, which should NOT be present in setup.cfg.

For more details, see install_requires vs requirements file.

Try to install this package with VCS Support with following command:

pip install git+http://gitlab.mff.cuni.cz/teaching/nswi177/2023/common/fscat.git

You perhaps noticed that the setup.cfg file contained the section [options.entry_points]. This section specifies what the actual scripts of your project are. Note that after running the above command, you can execute the fscat command directly. Pip created a wrapper script for you and added it to the sandbox $PATH.

fscat tar://tests/test.tar.gz testdir/test.txt

Now uninstall the package with:

pip uninstall matfyz-nswi177-fscat

Clone the repository to your local machine and change directory to it. Now run:

pip install -e .

pip install -e produces an editable installation for easy debugging. Instead of copying your code to the virtual environment, it installs only a symlink-like thing (actually, an fscat.egg-link file, which has a similar effect on Python’s mechanism for finding modules) referring to the directory with your source files.

Building a Python package

Now, when we already have the proper directory structure, we are only two steps from publishing it to Package Registry.

Now, we prepare distribution packages for our code. First, we install the build package by invoking pip install build. Then we can run

python3 -m build

Two files are created in the dist subdirectory:

  • matfyz-nswi177-fscat-0.0.1.tar.gz – a source code archive

  • matfyz_nswi177_fscat-0.0.1-py3-none-any.whl – a wheel file, which is the built package (py3 is the Python version required, none and any tell that this is a platform-independent package).

Note that the wheel file is nothing more that a simple Zip archive.

$ file dist/matfyz_nswi177_fscat-0.0.1-py3-none-any.whl
dist/matfyz_nswi177_fscat-0.0.1-py3-none-any.whl: Zip archive data, at least v2.0 to extract, compression method=deflate

$ unzip -l dist/matfyz_nswi177_fscat-0.0.1-py3-none-any.whl
Archive:  dist/matfyz_nswi177_fscat-0.0.1-py3-none-any.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
       51  2023-04-11 13:17   fscat/__init__.py
      837  2023-04-11 14:44   fscat/fscat.py
      777  2023-04-11 14:19   fscat/main.py
     1075  2023-04-11 15:20   matfyz_nswi177_fscat-0.0.1.dist-info/LICENSE
     1039  2023-04-11 15:20   matfyz_nswi177_fscat-0.0.1.dist-info/METADATA
       92  2023-04-11 15:20   matfyz_nswi177_fscat-0.0.1.dist-info/WHEEL
       42  2023-04-11 15:20   matfyz_nswi177_fscat-0.0.1.dist-info/entry_points.txt
        6  2023-04-11 15:20   matfyz_nswi177_fscat-0.0.1.dist-info/top_level.txt
      769  2023-04-11 15:20   matfyz_nswi177_fscat-0.0.1.dist-info/RECORD
---------                     -------
     4688                     9 files

You may wonder, why there are two archives with very similar content. The answer can be found in What Are Python Wheels and Why Should You Care?.

You can now switch to a different virtualenv and install the package using pip install package.whl.

Publishing a Python package

If you think that the package could be useful to other people, you can publish it in the Python Package Index. This is usually accomplished using the twine tool. The precise steps are described in Uploading the distribution archives.

Creating distribution packages (e.g. for DNF)

While the work for creating the project files may seem to complicate things a lot, it actually saves time in the long run.

Virtually any Python developer would be now able to install your program and have a clear starting point when investigating other details.

Note that if you have installed some program via DNF system-wide and that program was written in Python, somewhere inside it, there was setup.cfg that looked very similar to the one you have just seen. Only instead of installing the script into your virtual environment, it was installed globally.

There is really no other magic behind it.

Note that for example Ranger is written in Python and this script describes its installation (it is a script for creating packages for DNF). Note that %py3_install is a macro that actually calls setup.py install.

Higher-level tools

We can think of pip and virtualenv as low-level tools. However, there are also tools that combine both of them and bring more comfort to package management. In Python, there are at least two favorite choices, namely Poetry and Pipenv.

Internally, these tools use pip and venv, so you are still able to have independent working spaces as well as the possibility to install a specific package from the Python Package Index (PyPI).

The complete introduction of these tools is out of the scope for this course. Generally, they follow the same principles, but they add some extra functions that are nice to have. Briefly, the major differences are:

  • They can freeze specific versions of dependencies, so that the project builds the same on all machines (using poetry.lock file).
  • Packages can be removed together with their dependencies.
  • It is easier to initialize a new project.

Other languages

Other languages have their own tools with similar functions:

Before-class tasks (deadline: start of your lab, week April 24 - April 28)

The following tasks must be solved and submitted before attending your lab. If you have lab on Wednesday at 10:40, the files must be pushed to your repository (project) at GitLab on Wednesday at 10:39 latest.

For virtual lab the deadline is Tuesday 9:00 AM every week (regardless of vacation days).

All tasks (unless explicitly noted otherwise) must be submitted to your submission repository. For most of the tasks there are automated tests that can help you check completeness of your solution (see here how to interpret their results).

11/romandate.py (100 points, group devel)

Write a Python program that uses the packages roman and dateparser to print the date specified by the user in Roman numerals.

The program is best described by examples provided below (assuming they were executed on April 24, 2023).

./romandate.py
XXIV.IV.MMXXIII
./romandate.py 2021-01-01
I.I.MMXXI
./romandate.py 40 years ago
XXIV.IV.MCMLXXXIII

The tests assume that they are already executed inside a virtual environment where the above-mentioned packages are installed (when executed on GitLab, the tests installs these two packages for you automatically).

The provided tests only check for exact dates (when evaluating it after submission deadline we would insert the current date into the tests).

Do not forget to check your solution that it also works

  • when executed without parameters (time now)
  • when executed with relative dates such as 5 days ago

Post-class tasks (deadline: May 14)

We expect you will solve the following tasks after attending the labs and hearing feedback to your before-class solutions.

All tasks (unless explicitly noted otherwise) must be submitted to your submission repository. For most of the tasks there are automated tests that can help you check completeness of your solution (see here how to interpret their results).

11/project-name/ (70 points, group devel)

Prepare a Python package that provides a project-name command that tries to auto-detect project name.

Similarly to one of your very first tasks in this course, the program will look into README.md and README files for the first non-empty line (again stripping extra whitespace and leading # in *.md files).

When neither README.md or README are present, the program will try to find the top directory of a Git project (consider using the search_parent_directories=True constructor parameter and the working_tree_dir property of Repo from GitPython) and print its basename.

If the current directory is not a part of a Git project, the program will print the basename of the current directory.

We expect that the following would work (probably best executed in a virtual environment).


project-name
# Prints 'NSWI177 Submission Repository'
cd 01
project-name
# Prints 'student-LOGIN'
cd ../../
project-name
# Prints directory name of the parent directory of your submission repository clone

We expect that you will setup a proper src subdirectory and organize your package properly using setup.cfg etc.

Feel free to reuse parts of your solution of the 01 task.

The automated tests always create a new virtual environment for each test case. That is good for final check. But it is also possible to execute the tests inside activated virtual environment where they expect that the project-name command is already installed by setting NSWI177_LAB11_NO_INSTALL=true (i.e., they skip the pip install 11/project-name part which makes them much faster):

env NSWI177_LAB11_NO_INSTALL=true ./bin/run_tests.sh 11-post/project_name

11/fat.txt (30 points, group admin)

The file linux.ms.mff.cuni.cz:~/lab11.fat.img is a disk image with a single file. Paste its (decompressed) content into 11/fat.txt (to your GitLab submission repository).

Note that we can create the source file ~/lab11.fat.img only after you login to the remote machine for the first time.

If the file is not there, wait for the next work day for the file to appear.

Do not leave this task for the last minute and contact us if the file has not appeared as explained in the previous paragraph.

This task is not fully checked by the automated tests.

Learning outcomes

Learning outcomes provide a condensed view of fundamental concepts and skills that you should be able to explain and/or use after each lesson. They also represent the bare minimum required for understanding subsequent labs (and other courses as well).

Conceptual knowledge

Conceptual knowledge is about understanding the meaning and context of given terms and putting them into context. Therefore, you should be able to …

  • explain what is a disk image

  • explain why no special tools are required for working with disk images

  • explain difference between normal files, directories, symbolic links, device files and system-state files (e.g. from /proc filesystem)

  • list fundamental top-level directories on a typical Linux installation and describe their function

  • explain in general terms how the directory tree is formed by mounting individual (file) subsystems

  • explain what are requirements (library dependencies)

  • explain fundamentals of semantic versioning

  • explain what are pros and cons of installing dependencies system-wide vs installing them in a sandboxed environment

  • provide a high-level overview of a sandbox environment

  • explain pros and cons of specifying transitive requirements vs specification of top-level ones only

  • explain pros and cons of using exact versions vs minimal requirements

  • explain why Linux maintains separation of archiving and compression programs (e.g. tar and gzip)

Practical skills

Practical skills are usually about usage of given programs to solve various tasks. Therefore, you should be able to …

  • mount disks using the mount command (both physical disks as well as images)

  • get summary information about disk usage with df command

  • use either tar or atool to work with standard Linux archives

  • create a new virtual environment for Python using python3 -m venv

  • activate and deactivate virtual environment

  • install project dependencies in a virtual environment with pip

  • develop program inside a virtual environment (with projects using setup.cfg and pyproject.toml files)

  • install Python project from its setup.cfg

  • optional: use lsblk to view available block (storage) devices

  • optional: setup Python project for installation

This page changelog

  • 2023-04-17: Replace occurences of python to python3 for better clarity.

  • 2023-04-20: Automated tests for before class tasks.

  • 2023-04-28: Automated tests for post class tasks.

  • 2023-06-14: Note about mounting disks in VirtualBox.