Lab #10 | NSWI177 Labs | D3S

Information below is not for the current semester. The current semester can be found here.

Labs: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.

In this lab we will see how to simplify building complex software. We will also have a look at working with file archives and basics of storage maintenance in Linux.

Preflight checklist

You know shell control flow constructs by heart.

Running example

We will return again to our website generation example and use it as a running example for most of this lab.

We will again use the simpler version that looked like this:

#!/bin/bash

set -ueo pipefail

pandoc --template template.html index.md >index.html
pandoc --template template.html rules.md >rules.html
./table.py <score.csv | pandoc --template template.html --metadata title="Score" - >score.html

Notice that for index and rules, there are Markdown files to generate HTML from. Page score is generated from a CSV data file.

Setup

Please, create a fork of the web repository so that you can try the examples yourself (we will reuse this repository in one of the next labs, so do not remove it yet).

Motivation for using build systems

In our running example, the whole website is build in several steps where HTML pages are generated from different sources. That is actually very similar to how software is build from sources (consider sources in C language that are compiled and linked together).

While the above steps do not build an executable from sources (as is the typical case for software development), they represent a typical scenario.

Building a software usually consists of many steps that can include actions as different as:

compiling source files to some intermediate format
linking the final executable
creating bitmap graphics in different resolution from a single vector image
generating source-code documentation
preparing localization files with translation
creating a self-extracting archive
deploying the software on a web server
publishing an artefact in a package repository
…

Almost all of them are simple by themselves. What is complex is their orchestration. That is, how to run them in the correct order and with the right options (parameters).

For example, before an installer can be prepared, all other files have to be prepared. Localization files often depend on precompilation of some sources but have to be prepared before final executable is linked. And so on.

Even for small-size projects, the amount of steps can be quite high yet they are – in a sense – unimportant: you do not want to remember these, you want to build the whole thing!

Note that your IDE can often help you with all of this – with a single click. But not everybody uses the same IDE and you may even not have a graphical interface at all.

Furthermore, you typically want to run the build as part of each commit – the GitLab pipelines we use for tests are a typical example: they execute without GUI yet we want to build the software (and test it too). Codifying this in a build script simplifies this for virtually everyone.

Our build.sh script mentioned above is actually pretty nice. It is easy to understand, contains no complex logic and a new member of the team would not need to investigate all the tiny details and can just run the single build.sh script.

The script is nice but it overwrites all files even if there was no change. In our small example, it is no big deal (you have a fast computer, after all).

But in a bigger project where we, for example, compile thousands of files (e.g. look at source tree of Linux kernel, Firefox, or LibreOffice), it matters. If an input file was not changed (e.g. we modified only rules.md) we do not need to regenerate the other files (e.g., we do not need to re-create index.html).

Let’s extend our script a bit.

...

should_generate() {
    local barename="$1"
    if ! [ -e "${barename}.html" ]; then
        return 0
    fi
    if [ "${barename}.md" -nt "${barename}.html" ]; then
        return 0
    else
        return 1
    fi
}

...

should_generate index && pandoc --template template.html index.md >index.html
should_generate rules && pandoc --template template.html rules.md >rules.html

...

We can do that for every command to speed-up the web generation.

But.

That is a lot of work. And probably the time saved would be all wasted by rewriting our script. Not mentioning the fact that the result looks horrible. And it is rather expensive to maintain.

Also, we often need to build just a part of the project: e.g., regenerate documentation only (without publishing the result, for example). Although extending the script along the following way is possible, it certainly is not viable for large projects.

if [ -z "${1:-}" ]; then
    ... # build here
elif [ "${1:-}" = "clean" ]; then
    rm -f index.html rules.html score.html
elif [ "${1:-}" = "publish" ]; then
    cp index.html rules.html score.html /var/www/web-page/
else
    ...

Luckily, there is a better way.

There are special tools, usually called build systems that have a single purpose: to orchestrate the build process. They provide the user with a high-level language for capturing the above-mentioned steps for building software.

In this lab, we will focus on make. make is a relatively old build system, but it is still widely used. It is also one of the simplest tools available: you need to specify most of the things manually but that is great for learning. You will have full control over the process and you will see what is happening behind the scene.

`make`

Move into root directory of (the local clone of your fork of) the web example repository first, please.

The files in this directory are virtually the same as in our shell script above, but there is one extra file: Makefile. Notice that Makefile is written with capital M to be easily distinguishable (ls in non-localized setup sorts uppercase letters first).

This file is a control file for a build system called make that does exactly what we tried to imitate in the previous example. It contains a sequence of rules for building files.

We will get to the exact syntax of the rules soon, but let us play with them first. Execute the following command:

make

You will see the following output (if you have executed some of the commands manually, the output may differ):

pandoc --template template.html index.md >index.html
pandoc --template template.html rules.md >rules.html

make prints the commands it executes and runs them. It has built the website for us: notice that the HTML files were generated.

For now, we do not generate the version.inc.html file at all.

Execute make again.

make: Nothing to be done for 'all'.

As you can see, make was smart enough to recognize that since no file was changed, there is no need to run anything.

Update index.md (touch index.md would work too) and run make again. Notice how index.html was rebuilt while rules.html remained untouched.

pandoc --template template.html index.md >index.html

This is called an incremental build (we build only what was needed instead of building everything from scratch).

As we mentioned above: this is not much interesting in our tiny example. However, once there are thousands of input files, the difference is enormous.

It is also possible to execute make index.html to ask for rebuilding just index.html. Again, the build is incremental.

If you wish to force a rebuild, execute make with -B. Often, this is called an unconditional build.

In other words, make allows us to capture the simple individual commands needed for a project build (no matter if we are compiling and linking C programs or generating a web site) into a coherent script.

It rebuilds only things that need rebuilding and, more interestingly, it takes care of dependencies. For example, if scores.html is generated from scores.md that is build from scores.csv, we need only to specify how to build scores.md from scores.csv and how to create scores.html from scores.md and make will ensure the proper ordering.

`Makefile` explained

Makefile is a control file for the build system named make. In essence, it is a domain-specific language to simplify setting up the script with the should_generate constructs we mentioned above.

Unlike most programming languages, make distinguishes tabs and spaces. All indentation in the Makefile must be done using tabs. You have to make sure that your editor does not expand tabs to spaces. It is also a common issue when copying fragments from a web-browser.

(Usually, your editor will recognize that Makefile is a special file name and switch to tabs-only policy by itself.) If you use spaces instead, you will typically get an error like Makefile:LINE_NUMBER: *** missing separator. Stop..

The Makefile contains a sequence of rules. A rule looks like this:

index.html: index.md template.html
    pandoc --template template.html index.md >index.html

The name before the colon is the target of the rule. That is usually a file name that we want to build. Here, it is index.html.

The rest of the first line is the list of dependencies – files from which the target is built. In our example, the dependencies are index.md and template.html. In other words: when these files (index.md and template.html) are modified we need to rebuild index.html.

The third part are the following lines that has to be indented by tab. They contain the commands that have to be executed for the target to be built. Here, it is the call to pandoc.

make runs the commands if the target is out of date. That is, either the target file is missing, or one or more dependencies are newer than the target.

The rest of the Makefile is similar. There are rules for other files and also several special rules.

Special rules

The special rules are all, clean, and .PHONY. They do not specify files to be built, but rather special actions.

all is a traditional name for the very first rule in the file. It is called a default rule and it is built if you run make with no arguments. It usually has no commands and it depends on all files which should be built by default.

clean is a special rule that has only commands, but no dependencies. Its purpose is to remove all generated files if you want to clean up your work space. Typically, clean removes all files that are not versioned (i.e., under Git control).

This can be considered misuse of make, but one with a long tradition. From the point of view of make, the targets all and clean are still treated as file names. If you create a file called clean, the special rule will stop working, because the target will be considered up to date (it exists and no dependency is newer).

To avoid this trap, you should explicitly tell make that the target is not a file. This is done by listing it as a dependency of the special target .PHONY (note the leading dot).

Generally, you can see that make has a plenty of idiosyncrasies. It is often so with programs which started as a simple tool and underwent 40 years of incremental development, slowly accruing features. Still, it is one of the most frequently used build systems. Also, it often serves as a back-end for more advanced tools – they generate a Makefile from a more friendly specification and let make do the actual work.

Exercise

Extend the Makefile to call the generating script for the score.html page. Do not forget to update the all and clean rules.

Solution.

There is an empty out/ subdirectory (it contains only .gitignore that specifies that all files in this directory shall be ignored by Git and thus not shown by git status).

Update the Makefile to generate files into this directory. The reasons are obvious:

The generated files will not clutter your working directory (you do not want to commit them anyway).
When syncing to a webserver, we can specify the whole directory to be copied (instead of specifying individual files).

Solution.

Add a phony target upload that will copy the generated files to a machine in Rotunda. Create (manually) a directory there ~/WWW. Its content will be available as http://www.ms.mff.cuni.cz/~LOGIN/.

Note that you will need to add the proper permissions for the AFS filesystem using the fs setacl command (recall lab 08).

Solution.

Add generation of PDF from rules.md (using LibreOffice). Note that soffice supports a --outdir parameter.

Think about the following:

Where to place the intermediate ODT file?
Shall there be a special rule for the generation of the ODT file or shall it be done with a single rule with two commands?

Solution.

Improving the maintainability of the `Makefile`

The Makefile starts to have too much of a repeated code.

But make can help you with that too.

Let’s remove all the rules for generating out/*.html from *.md and replace them with:

out/%.html: %.md template.html
      pandoc --template template.html -o $@ $<

That is a pattern rule that captures the idea that HTML is generated from Markdown. the percent sign in the dependencies and target specification represents so called stem – the variable (i.e., changing) part of the pattern.

In the command part, we use make variables. make variables start with dollar as in shell but they are not the same.

$@ is the actual target and $< is the first dependency.

Run make clean && make to verify that even with pattern rules, the web is still generated.

Apart from pattern rules, make also understands (user) variables. They can improve readability as you can separate configuration from commands. For example:

PAGES = \
      out/index.html \
      out/rules.html \
      out/score.html

all: $(PAGES) ...
...

Note that unlike in the shell, variables are expanded by the $(VAR) construct. (Except for the special variables such as $<.)

Non-portable extensions

make is a very old tool that exists in many different implementations. The features mentioned so far should work with any version of make. (At least a reasonably recent one. Old makes did not have .PHONY or pattern rules.)

The last addition will work in GNU make only (but that is the default on Linux so there shall not be any problem).

We will change the Makefile as follows:

PAGES = \
      index \
      rules \
      score

PAGES_TMP=$(addsuffix .html, $(PAGES))
PAGES_HTML=$(addprefix out/, $(PAGES_TMP))

We keep only the basename of each page and we compute the output path. $(addsuffix ...) and $(addprefix ...) are calls to built-in functions. Formally, all function arguments are strings, but in this case, comma-separated names are treated as a list.

Note that we added PAGES_TMP only to improve readability when using this feature for the first time. Normally, you would only have PAGES_HTML assigned directly to this.

PAGES_HTML=$(addprefix out/, $(addsuffix .html, $(PAGES)))

This will prove even more useful when we want to generate a PDF for each page, too. We can add a pattern rule and build the list of PDFs using $(addsuffix .pdf, $(PAGES)).

Further exercises …

… are at the end of this page.

Storage management

Before proceeding, recall that files reside on file systems that are the structures on the actual block devices (typically, disks).

Working with file systems and block devices is necessary when installing a new system, rescuing from a broken device, or simply checking available free space.

You are already familiar with normal files and directories. But there are other types of files that you can find on a Linux system.

Symbolic links

Linux allows to create a symbolic link to another file. This special file does not contain any content by itself and merely points to another file.

An interesting feature of a symbolic link is that it is transparent to standard file I/O API. If you call Pythonic open on a symbolic link, it will transparently open the file the symbolic link points to. That is the intended behavior.

The purpose of symbolic links is to allow different perspectives on the same files without need for any copying and synchronization.

For example, a movie player is able to play only files in directory Videos. However, you actually have the movies elsewhere because they are on a shared hard drive. With the use of a symbolic link, you can make Videos a symbolic link to the actual storage and make the player happy. (For the record, we do not know about any movie player with such behaviour, but there are plenty of other programs where such magic can make them work in a complex environment they were not originally designed for.)

Note that a symbolic link is something else than what you may know as Desktop shortcut or similar. Such shortcuts are actually normal files where you can specify which icon to use and also contain information about the actual file. Symbolic links operate on a lower level.

Special files

There are also other special files that represent physical devices or files that serve as a spy-hole into the state of the system.

The reason is that it is much simpler for the developer that way. You do not need special utilities to work with a disk, you do not need a special program to read the amount of free memory. You simply read the contents of a well-known file and you have the data.

It is also much easier to test such programs because you can easily give them mock files by changing the file paths – a change that is unlikely to introduce a serious bug into the program.

Usually Linux offers the files that reveal state of the system in a textual format. For example, the file /proc/meminfo can look like this:

MemTotal:        7899128 kB
MemFree:          643052 kB
MemAvailable:    1441284 kB
Buffers:          140256 kB
Cached:          1868300 kB
SwapCached:            0 kB
Active:           509472 kB
Inactive:        5342572 kB
Active(anon):       5136 kB
Inactive(anon):  5015996 kB
Active(file):     504336 kB
Inactive(file):   326576 kB
...

This file is nowhere on the disk but when you open this path, Linux creates the contents on the fly.

Notice how the information is structured: it is a textual file, so reading it requires no special tools and the content is easily understood by a human. On the other hand, the structure is quite rigid: each line is a single record, keys and values are separated by a colon. Easy for machine parsing as well.

File system hierarchy

We will now briefly list some of the key files you can find on virtually any Linux machine.

Do not be afraid to actually display contents of the files we mention here. hexdump -C is really a great tool.

/boot contains the bootloader for loading the operating system. You would rarely touch this directory once the system is installed.

/dev is a very special directory where hardware devices have their file counterparts. You will probably see there a file sda or nvme0 that represents your hard (or SSD) drive. Unless you are running under a superuser, you will not have access to these files, but if you would hexdump them, you would see the bytes as they are on the actual hard drive.

And writing to such files would overwrite the data on your drive!

The fact is that disk utilities in Linux accept paths to the disk drives they will operate on. Thus it is very easy to give it a file and pretend that it is a disk to be formatted. That can be used to create disk images or for file recovery. And it greatly simplifies the testing of such tools because you do not need to have a real disk for testing.

It is important to note that these files are not physical files on your disk (after all, it would mean having a disk inside a disk). When you read from them, the kernel recognizes that and returns the right data.

This directory also contains several special but very useful files for software development.

/dev/urandom returns random bytes indefinitely. It is probably internally used inside your favorite programming language to implement its random() function. Try to run hexdump on this file (and recall that <Ctrl>-C will terminate the program once you are tired of the randomness).

There is also /dev/full that emulates a full disk, /dev/null that discards everything written to it or /dev/zero that supplies an infinite stream of zero bytes.

/etc/ contains system-wide configuration. Typically, most programs in UNIX systems are configured via text files. The reasoning is that an administrator needs to learn only one tool – a good text editor – for system management. The advantage is that most configuration files have support for comments and it is possible to comment even on the configuration. For an example of such a configuration file, you can have a look at /etc/systemd/system.conf to get the feeling.

Perhaps the most important file is /etc/passwd that contains a list of user accounts. Note that it is a plain text file where each row represents one record and individual attributes are simply separated by a colon :. Very simple to read, very simple to edit, and very simple to understand. In other words, the KISS principle in practice.

/home contains home directories for normal user accounts (i.e., accounts for real – human – users).

/lib and /usr contain dynamic libraries, applications, and system-wide data files.

/var is for volatile data. If you would install a database or a web server on your machine, its files would be stored here.

/tmp is a generic location for temporary files. This directory is automatically cleaned at each reboot, so do not use it for permanent storage. Many systems also automatically remove files which were not modified in the last few days.

/proc is a virtual file system that allows controlling and reading of kernel (operating system) settings. For example, the file /proc/meminfo contains quite detailed information about RAM usage.

Again, /proc/* are not normal files, but virtual ones. Until you read them, their contents do not exist physically anywhere.

When you open /proc/meminfo, the kernel will read its internal data structures, prepare its content (in-memory only), and give it to you. It is not that this file would be physically written every 5 seconds or so to contain the most up-to-date information.

Mounts and mount-points

Each file system (that we want to access) is accessible as a directory somewhere (compared to a drive letter in other systems, for example).

When we can access /dev/sda3 under /home we say that /dev/sda3 is mounted under /home, /home is then called the mount point, /dev/sda3 is often called a volume.

Most devices are mounted automatically during boot. This includes / (root) where the system is as well as /home where your data reside. File systems under /dev or /proc are actually special file systems that are mounted to these locations. Hence, the file /proc/uptime does not physically exist (i.e., there is no disk block with its content anywhere on your hard drive) at all.

The file systems that are mounted during boot are listed in /etc/fstab. You will rarely need to change this file on your laptop and this file was created for you during installation. Note that it contains volume identification (such as path to the partition), the mount point and some extra options.

When you plug-in a removable USB drive, your desktop environment will typically mount it automatically. Mounting it manually is also possible using the mount utility. However, mount has to be run under root to work (this thread explains several aspects why mounting a volume could be a security risk). Therefore, you need to play with this on your installations where you can become root. It will not work on any of the shared machines.

Technical note: the above text may seem contradictory, as mount requires root password yet your desktop environment (DE) may mount the drive automatically without asking for any password. Internally, your DE does not call mount, but it talks to daemons called Udisks and Polkit which run with root privileges. The daemons together verify that the mounted device is actually a removable one and that the user is a local one (i.e., it will not work over SSH). If these conditions are satisfies, it mounts the disk for the given user. By the way, you can talk to Udisks from the shell using udisksctl.

To test the manual mounting, plug-in your USB device and unmount it in your GUI if it was mounted automatically (note that the usual path the device is mounted is somewhere under /media).

Your USB will probably be available as /dev/sdb1 or /dev/sda1 depending what kind of disk you have (consult the following section about lsblk to view the list of drives).

Mounting disks is not limited to physical drives only. We will talk about disk images in the next section but there are other options, too. It is possible to mount a network drive (e.g., NFS or AFS used in MFF labs) or even create a network block device and then mount it.

If you are running virtualized Linux, e.g. inside VirtualBox, mounting disks is a bit more complex. You can attach another virtual disk to it and mount it manually Or you can create a so called pass-through and let the virtual machine access your physical drive directly. For example, in VirtualBox, it is possible to access physical partition of a real hard-drive but for experimenting it is probably safer to start with a USB pass-through that makes available your USB pendrive inside the guest. But always make sure that the physical device is not used by the host.

Working with disk images

Linux has built-in support for working with disk images. That is, with files with content mirroring a real disk drive. As a matter of fact, you probably already worked with them when you set up Linux in a virtual machine or when you downloaded the USB disk image at the beginning of the semester.

Linux allows you to mount such image as if it was a real physical drive and modify the files on it. That is essential for the following areas:

Developing and debugging file systems (rare)
Extracting files from virtual machine hard drives
Recovering data from damaged drives (rare, but priceless)

When recovering data from damaged drives, the typical approach is to try to copy the data from the file as-is on the lowest level possible (typically, copying the raw bytes without interpreting them as a file system or actual files). Only after you recover the disk (mirror) image, you run the actual recovery tools on the image. That prevents further damage to the hard drive and gives you a plenty of time for the actual recovery.

In all cases, to mount the disk image we need to tell the system to access the file in the same way as it accesses other block devices (recall /dev/sda1 from the example above).

Mounting disks manually

sudo mkdir /mnt/flash
sudo mount /dev/sdb1 /mnt/flash

Your data shall be visible under /mnt/flash.

To unmount, run the following command:

sudo umount /mnt/flash

Note that running mount without any arguments prints a list of currently active mounts. For this, root privileges are not required.

Specifying volumes

So far, we always used the name of the block device (e.g., /dev/sdb1) to specify the volume. While this is trivial on small systems, it can be incredibly confusing on larger ones – device names depend on the order in which the system discovered the disks. This order can vary between boots and it is even less stable with removable drives. You do not want to let a randomly connected USB flash disk render your machine non-bootable :-).

A more stable way is to refer to block devices using symlinks named after the physical location in the system. For example, /dev/disk/by-path/pci-0000:03:00.1-ata-6-part1 refers to partition 1 of a disk connected to port 6 of a SATA controller which resides as device 00.1 on PCI bus 0000:03.

In most cases, it is even better to describe the partition by its contents. Most filesystems have a UUID (universally unique identifier, a 128-bit number, usually randomly generated) and often also a disk label (a short textual name). You can run lsblk -f to view UUIDs and labels of all partitions and then call mount with UUID=number or LABEL=name instead of the block device name. Your /etc/fstab will likely refer to your volumes in one of these ways.

Mounting disk images

Disk images can be mounted in almost the same way as block devices, you only have to add the -o loop option to mount.

Recall that mount requires root (sudo) privileges hence you need to execute the following example on your own machine, not on any of the shared ones.

To try that, you can download this FAT image and mount it.

sudo mkdir /mnt/photos-fat
sudo mount -o loop photos.fat.img /mnt/photos-fat
... (work with files in /mnt/photos-fat)
sudo umount /mnt/photos-fat

Alternatively, you can run udisksctl loop-setup to add the disk image as a removable drive that could be automatically mounted in your desktop:

# Using udisksctl and auto-mounting in GUI
udisksctl loop-setup -f fat.img
# This will probably print /dev/loop0 but it can have a different number
# Now mount it in GUI (might happen completely automatically)
... (work with files in /run/media/$(whoami)/07C5-2DF8/)
udisksctl loop-delete -b /dev/loop0

Disk space usage utilities

The basic utility for checking available disk space is df (disk free).

Filesystem     1K-blocks    Used Available Use% Mounted on
devtmpfs         8174828       0   8174828   0% /dev
tmpfs            8193016       0   8193016   0% /dev/shm
tmpfs            3277208    1060   3276148   1% /run
/dev/sda3      494006272 7202800 484986880   2% /
tmpfs            8193020       4   8193016   1% /tmp
/dev/sda1        1038336  243188    795148  24% /boot

In the default execution (above), it uses one-kilobyte blocks. For a more readable output, run it with -BM or -BG (megas and gigas) or with -h to let it select the most suitable unit.

Do not confuse df with du which can be used to estimate file space usage. Typically, you would run du as du -sh DIR to print total space occupied by all files in DIR. You could use du -sh ~/* to print summaries for top-level directories in your $HOME. But be careful as it can take quite some time to scan everything.

Also, you can observe that the space usage reported by du is not equal to the sum of all file sizes. This happens because files are organized in blocks, so file sizes are typically rounded to a multiple of the block size. Besides that, directories also consume some space.

To see how volumes (partitions) are nested and which block devices are recognized by your kernel, you can use lsblk. On the shared machine, the following will appear:

NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0   480G  0 disk
├─sda1   8:1    0     1G  0 part /boot
├─sda2   8:2    0   7.9G  0 part [SWAP]
└─sda3   8:3    0 471.1G  0 part /

This shows that the machine has a 480G disk divided into three partitions: a tiny /boot for boostrapping the system, a 8G swap partition, and finally 470G left for system and user data. We are not using a separate volume for /home.

You can find many other output formats in the man page.

Inspecting and modifying volumes (partitions)

We will leave this topic to a more advanced course. If you wish to learn by yourself, you can start with the following utilities:

fdisk(8)
btrfs(8)
mdadm(8)
lvm(8)

Check you understand it all

Select all true statements. You need to have enabled JavaScript for the quiz to work.

File archiving and compression

A somewhat related topic to the above is how Linux handles file archival and compression.

Archiving on Linux systems typically refers to merging multiple files into one (for easier transfer) and compression of this file (to save space). Sometimes, only the first step (i.e., merging) is considered archiving.

While these two actions are usually performed together, Linux keeps the distinction as it allows combination of the right tools and formats for each part of the job. Note that on other systems where the ZIP file is the preferred format, these actions are blended into one.

The most widely used program for archiving is tar. Originally, its primary purpose was archiving on tapes, hence the name: tape archiver. It is always run with an option specifying the mode of operation:

-c to create a new archive from existing files,
-x to extract files from the archive,
-t to print the table of files inside the archive.

The name of the archive is given via the -f option; if no name is specified, the archive is read from standard input or written to standard output.

As usually, the -v option increases verbosity. For example, tar -cv prints names of files added to the archive, tar -cvv prints also file attributes (like ls -l). (Everything is printed to stderr, so that stdout can be still used for the archive.) Plain tar -t prints only file names, tar -tv prints also file attributes.

An uncompressed archive can be created this way:

tar -cf archive.tar dir_to_archive/

A compressed archive can be created by piping the output of tar to gzip:

tar -c dir_to_archive/ | gzip >archive.tar.gz

As this is very frequent, tar supports a -z switch, which automatically calls gzip, so that you can write:

tar -czf archive.tar.gz dir_to_archive/

tar has further switches for other (de)compression programs: bzip2, xz, etc.. Most importantly, the -a switch chooses the (de)compression program according to the name of the archive file.

If you want to compress a single file, plain gzip without tar is often used. Some tools or APIs can even process gzip-compressed files transparently.

To unpack an archive, you can again pipe gzip -d (decompress) to tar, or use -z as follows:

tar -xzf archive.tar.gz

Like many other file-system related programs, tar will overwrite existing files without any warning.

We recommend to install atool as a generic wrapper around tar, gzip, unzip and plenty of other utilities to simplify working with archives. For example:

apack archive.tar.gz dir_to_archive/
aunpack archive.tar.gz

Note that atool will not overwrite existing files by default (which is another very good reason for using it).

It is a good practice to always archive a single directory. That way, user that unpacks your archive will not have your files scattered in the current directory but neatly prepared in a single new directory.

To view the list of files inside an archive, you can execute als.

Tasks to check your understanding

We expect you will solve the following tasks before attending the labs so that we can discuss your solutions during the lab.

The aspell package provides a spell checker that can be used from the command-line.

Running aspell list --master en will read standard input and print all words with typos to standard output.

Extend your web project to check that there are no typos in the source pages.

This task will extend the running example that we have used through this lab.

We expect you will copy the files to your submission repository where the automated tests are (simply copying the whole directory is fine).

You already did the following (but it is also part of the automated tests of this task):

Generate index.html and rules.html from respective *.md files.
Store the generated files in out/ subdirectory.
clean target removes all files in out/ (except for .gitignore).

As a new feature, we expect you will extend the example with the following:

Move source files to src/ subdirectory. This is a mandatory part, without this move none of the tests will work. We expect you will move the files yourself, i.e. not during the build. The purpose is to make the directory structure a bit cleaner. There should be thus file 10/web/src/index.md committed in your repository.
Generate pages from *.csv files. There is already generation of the score.html from score.csv. We expect you would add your own group-a.csv and group-b.csv files that will be generated to group-a.html and group-b.html files (using the table.py script as for score.csv). group-a.html and group-b.html should be generated by default.
Generate pages from *.bin files. We expect that the file would have the same basename as the resulting .html and it will take care of complete content generation. The test creates from-news.bin script for testing this, your solution must use pattern rules with proper stems.

The example below is intentionally named differently (news.bin) so it does not collide with the file prepared by the test.
Add a phony target spelling that list typos in Markdown files. We expect you will use aspell for this task and use English as the master language.

Hint #1: use PAGES variable to set list of generated files as it simplifies maintenance of your Makefile.

Hint #2: the following is a simple example of a dynamically generated webpage that can be stored inside src/news.bin. The script is a little bit tricky as it contains data as part of the script and uses $0 to read itself (similar trick is often used when creating self-extracting archives for Linux).

#!/bin/bash

set -ueo pipefail

sed '1,/^---NEWS-START-HERE---/d' "$0" | while read -r date comment; do
    echo "<dt>$date</dt>"
    echo "<dd>$comment</dd>"
done | pandoc --template template.html --metadata title="News" -

exit 0

# Actual news are stored below.
# Add each news item on a separate line
#
# !!! Do not modify the line NEWS-START-HERE !!!
#

---NEWS-START-HERE---
2023-05-01 Website running
2023-05-02 Registration open

This example can be checked via GitLab automated tests. Store your solution as 10/web/Makefile and commit it (push it) to GitLab.

Convert the shell builder of an executable (built from C sources) into a make-based build.

The sources are in the examples repository (in 10/cc).

The Makefile you create shall offer the following:

Default target all builds the example executable.
Special target clean removes all intermediary files (*.o) as well as the final executable (example).
Object files (.o) are built for each file separately, we recommend to use a pattern rule.
Object files must depend on the source file (corresponding .c file) as well as on the header file.

Please, commit the source files to your repository as well.

For more complex projects in C the Makefile is often semi-generated (including proper dependencies on included header files). In this task we expect you will specify all the dependencies manually to demonstrate your understanding of make.

There is only one header file so the dependency list will be actually quite short.

This example can be checked via GitLab automated tests. Store your solution as 10/cc/Makefile and commit it (push it) to GitLab.

Learning outcomes

Learning outcomes provide a condensed view of fundamental concepts and skills that you should be able to explain and/or use after each lesson. They also represent the bare minimum required for understanding subsequent labs (and other courses as well).

Conceptual knowledge

Conceptual knowledge is about understanding the meaning and context of given terms and putting them into context. Therefore, you should be able to …

name several steps that are often required to create distributable software (e.g. a package or an installer) from source code and other basic artifacts
explain why software build should be a reproducible process
explain how it is possible to capture a software build
explain concepts of languages that are used for capturing steps needed for software build (distribution)
explain what is a disk image
explain why no special tools are required for working with disk images
explain difference between normal files, directories, symbolic links, device files and system-state files (e.g. from /proc filesystem)
list fundamental top-level directories on a typical Linux installation and describe their function
explain in general terms how the directory tree is formed by mounting individual (file) subsystems
explain why Linux maintains separation of archiving and compression programs (e.g. tar and gzip)

Practical skills

Practical skills are usually about usage of given programs to solve various tasks. Therefore, you should be able to …

build make-based project with default settings
create Makefile that drives build of a simple project
use wildcard rules in Makefile
optional: use variables in a Makefile
optional: use basic GNU extensions to simplify complex Makefiles
mount disks using the mount command (both physical disks as well as images)
get summary information about disk usage with df command
use either tar or atool to work with standard Linux archives
optional: use lsblk to view available block (storage) devices

all: index.html rules.html score.html

...

score.html: score.csv template.html
    ./table.py <score.csv | pandoc --template template.html --metadata title="Score" - >score.html

...

clean:
    rm -f index.html rules.html score.html

Note that you may consider depending on the script too.

score.html: score.csv template.html table.py

all: out/index.html out/rules.html out/score.html out/main.css

.PHONY: all clean

clean:
    rm -f out/*.html out/*.css

out/index.html: index.md template.html
    pandoc --template template.html index.md >out/index.html

out/rules.html: rules.md template.html
    pandoc --template template.html rules.md >out/rules.html

out/score.html: score.csv template.html table.py
    ./table.py <score.csv | pandoc --template template.html --metadata title="Score" - >out/score.html

out/main.css: main.css
    cp main.css out/

Recall the fs setactl commands that are needed to be executed (just once, no need to put them into the Makefile).

fs setacl ~/WWW www read
fs setacl ~/. www l

Changes to Makefile are rather minimal.

.PHONY: all clean upload

...

upload:
    scp out/* LOGIN@u-pl1.ms.mff.cuni.cz:WWW/make/

We generate the PDF in two steps. Because the intermediate file has no other dependencies, it would be possible to generate it inside a single rule without any issues. We decided for this mostly because we believe it is a more readable solution.

all: out/index.html out/rules.html out/score.html out/main.css out/rules.pdf

...

out/rules.pdf: tmp/rules.odt
    soffice --headless --convert-to pdf tmp/rules.odt --outdir out/

tmp/rules.odt: rules.md
    mkdir -p tmp/
    pandoc -o tmp/rules.odt rules.md

...

clean:
    rm -rf tmp/ out/*

Preflight checklist

Running example

Setup

Motivation for using build systems

make

Makefile explained

Special rules

Exercise

Improving the maintainability of the Makefile

Non-portable extensions

Further exercises …

Storage management

Symbolic links

Special files

File system hierarchy

Mounts and mount-points

Working with disk images

Mounting disks manually

Specifying volumes

Mounting disk images

Disk space usage utilities

Inspecting and modifying volumes (partitions)

Check you understand it all

File archiving and compression

Tasks to check your understanding

Learning outcomes

Conceptual knowledge

Practical skills

`make`

`Makefile` explained

Improving the maintainability of the `Makefile`