Labs: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11.
Please, see latest news in issue #92 (from April 03).
In this lab we will see how to simplify building complex software and how to effectively do search (and replace) in textual data.
This lab also contains a mini homework for two points.
Reading network configuration
Before diving into the main topics we will do a small detour to a practical thing that comes very useful. And that is how to view network configuration of your machine from the command-line.
We have already seen nmcli
but there are other tools. Among them is also
ip
(from the iproute2
package) that can be used to configure networking
as well (though rather on servers than on workstations where NetworkManager
is usually the default).
For the following text we will assume your machine is connected to the Internet (this includes your virtualized installation of Linux).
The basic command for setting and reading network configuration is ip
.
Probably the most useful one for us at the moment is ip addr
.
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s31f6: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
link/ether 54:e1:ad:9f:db:36 brd ff:ff:ff:ff:ff:ff
3: wlp58s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 44:03:2c:7f:0f:76 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.105/24 brd 192.168.0.255 scope global dynamic noprefixroute wlp58s0
valid_lft 6209sec preferred_lft 6209sec
inet6 fe80::9ba5:fc4b:96e1:f281/64 scope link noprefixroute
valid_lft forever preferred_lft forever
8: vboxnet0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 0a:00:27:00:00:00 brd ff:ff:ff:ff:ff:ff
It lists four interfaces (lo
, enp0s31f6
, wlp58s0
and vboxnet0
) that are available on the machine.
Your list will differ as well as the naming.
The name signifies interface type.
lo
is the loopback device that will be always present. With loopback device, you can test network applications even without having a “real” connectivity.enp0s31f6
(often alsoeth*
) is a wired ethernet.wlp58s0
is a wireless adapter.vboxnet0
is a virtual network card used by VirtualBox when you create a virtual subnet for your virtual machines (you will probably not have this one there).
If you are connected via VPN, you might also see a tun0
interface.
The state of the interface (up and running or not) is at the same line as the adapter name.
The link/
denotes the MAC address of the adapter.
Lines with inet
specify the IP address assigned to this interface, including the network.
In this example, lo
has 127.0.0.1/8
(obviously),
enp0s31f6
is without an address (state DOWN
)
and wlp58s0
has address 192.168.0.105/24
(i.e., 192.168.0.105
with netmask 255.255.255.0
).
Your addresses will be slightly different, but typically you will see also a private address (behind a NAT), as you are probably connecting through a router to your ISP.
Regular expressions (a.k.a. regexes)
We already mentioned that systems from the Unix family are built on top of text files. The utilities we have seen so far offered basic operations, but none of them was really powerful. Use of regular expressions will change that.
We will not cover the theoretical details – see the course on Automata and grammars for that. We will view regular expressions as simple tools for matching of patterns in text.
For example, we might be interested in:
- lines starting with date and containing HTTP code 404,
- files containing our login,
- or a line preceding a line with a valid filename.
The most basic tool for matching files against regular expressions is
called grep
.
If you run grep
regex file, it prints all lines of file which match
the given regex (with -F
, the pattern is considered a fixed string, not
a regular expression).
Regex syntax
In its simplest form, a regex searches for the given string (usually in case-sensitive manner).
system
This matches all substrings system
in the text. In grep
, this means that
all lines containing system
will be printed.
If we want to search lines starting with this word, we need to add
an anchor ^
.
^system
If the line is supposed to end with a pattern, we need to use the $
anchor.
Note that it is safer to use single quotes in the shell to prevent any variable
expansion.
system$
Moreover, we can find all lines starting with either r
, s
or t
using
the [...]
list.
^[rst]
This looks like a wildcard, but regexes are more powerful and the syntax differs a bit.
For actual searching, we obviously need to pass this regular expression to
grep
like this (here we search in /etc/passwd
):
grep '^[rst]' /etc/passwd
Let us find all three-digit numbers:
[0-9][0-9][0-9]
We can also find lines not starting with any of letter between
r
and z
. (The first ^
is an anchor, while the second one negates
the set in []
.)
^[^r-z]
The quantifier *
denotes that the previous part of the regex can appear
multiple times or never at all. For example, this finds all lines which consist
of digits only (and captures empty lines too!):
^[0-9]*$
Note that this does not require that all digits are the same.
A dot .
matches any single character (except newline). So the following regex
matches lines starting with super
and ending with ious
:
^super.*ious$
When we want to apply the *
to a more complex subexpression, we can surround
it with (...)
. The following regex matches bana
, banana
, bananana
, and so on:
ba(na)*na
If we use +
instead of *
, at least one occurrence is required. So this matches
all decimal numbers:
[0-9]+
The vertical bar ("|
" a.k.a. the pipe) can separate alternatives. For example,
we can match lines composed of Meow
and Quork
:
^(Meow|Quork)*$
The [abc]
construct is therefore just an abbreviation for (a|b|c)
.
Another useful shortcut is the {
N}
quantifier: it specifies that the preceding
regex is to be repeated N times. We can also use {
N,
M}
for a range.
For example, we can match lines which contain 4 to 10 lower-case letters enclosed
in quotation marks:
^"[a-z]{4,10}"$
Finally, the backslash character changes whether the next character is considered
special. The \.
matches a literal dot, \*
a literal asterisk. Beware that
many regex dialects (including grep
without further options) require +
, (
, |
, and {
to be escaped to make them recognized as regex operators. (You can run grep -E
or egrep
to activate extended regular expressions, which have all special characters
recognized as operators without backslashes.)
Therefore, it can be used like this:
if ! echo "$input" | grep 'regex'; then
echo "Input is not in correct format." >&2
...
fi
Text substitution
The full power of regular expressions is unleashed when we use them
to substitute patterns.
We will show this on sed
(a stream editor) which can perform regular
expression-based text transformations.
In its simplest form, sed
replaces one word by another.
The command reads: substitute (s
), then a single-character
delimiter, followed by the text to be replaced (the left-hand side
of the substitution), again the same delimiter, then the replacement
(the right-hand side), and one final occurrence of the delimiter.
(The delimiter is typically :
, /
, or #
,
but generally it can be any character that is not used without escaping in
the rest of the command.)
sed 's:magna:angam:' lorem.txt
Note that this replaces only the first occurrence on each line.
Adding a g
modifier (for global) at the end of the command causes
it to replace all occurrences:
sed 's:magna:angam:g' lorem.txt
The text to be replaced can be any regular expression, for example:
sed 's:[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]:DATE-REDACTED-OUT:g' lorem.txt
The right-hand side can refer to the text matched by the left-hand side.
We can use &
for the whole left-hand side or \
n for the n-th
group (...)
in the left-hand side.
The following example transforms the date into the Czech form (DD. MM. YYYY).
We have to escape the (
and )
characters to make them
act as grouping operators instead of literal (
and )
.
sed 's:\([0-9][0-9][0-9][0-9]\)-\([0-9][0-9]\)-\([0-9][0-9]\):\3. \2. \1:g'
Running example for the rest of the lab
We will return again to our website generation example and use it as a running example for the rest of this lab.
We will again use the simpler version that looked like this:
#!/bin/bash
set -ueo pipefail
pandoc --template template.html index.md >index.html
pandoc --template template.html rules.md >rules.html
./table.py <score.csv | pandoc --template template.html --metadata title="Score" - >score.html
Notice that for index
and rules
, there are Markdown files to generate
HTML from.
Page score
is generated from a CSV data file.
Setup
Please, create a fork of the web repository so that you can try the examples yourself (we will reuse this repository in one of the next labs, so do not remove it yet).
Motivation for using build systems
In our running example, the whole website is build in several steps where HTML pages are generated from different sources. That is actually very similar to how software is build from sources (consider sources in C language that are compiled and linked together).
While the above steps do not build an executable from sources (as is the typical case for software development), they represent a typical scenario.
Building a software usually consists of many steps that can include actions as different as:
- compiling source files to some intermediate format
- linking the final executable
- creating bitmap graphics in different resolution from a single vector image
- generating source-code documentation
- preparing localization files with translation
- creating a self-extracting archive
- deploying the software on a web server
- publishing an artefact in a package repository
- …
Almost all of them are simple by themselves. What is complex is their orchestration. That is, how to run them in the correct order and with the right options (parameters).
For example, before an installer can be prepared, all other files have to be prepared. Localization files often depend on precompilation of some sources but have to be prepared before final executable is linked. And so on.
Even for small-size projects, the amount of steps can be quite high yet they are – in a sense – unimportant: you do not want to remember these, you want to build the whole thing!
Note that your IDE can often help you with all of this – with a single click. But not everybody uses the same IDE and you may even not have a graphical interface at all.
Furthermore, you typically want to run the build as part of each commit – the GitLab pipelines we use for tests are a typical example: they execute without GUI yet we want to build the software (and test it too). Codifying this in a build script simplifies this for virtually everyone.
Our build.sh
script mentioned above is actually pretty nice.
It is easy to understand, contains no complex logic and a new member of the
team would not need to investigate all the tiny details and can just run
the single build.sh
script.
The script is nice but it overwrites all files even if there was no change. In our small example, it is no big deal (you have a fast computer, after all).
But in a bigger project where we, for example, compile thousands
of files (e.g. look at source tree of Linux kernel, Firefox, or LibreOffice),
it matters.
If an input file was not changed (e.g. we modified only rules.md
)
we do not need to regenerate the other files
(e.g., we do not need to re-create index.html
).
Let’s extend our script a bit.
...
should_generate() {
local barename="$1"
# File does not exist ... we should generate it
if ! [ -e "${barename}.html" ]; then
return 0
fi
# Markdown is newer than HTMLM ... we should regenerate it
if [ "${barename}.md" -nt "${barename}.html" ]; then
return 0
else
return 1
fi
}
...
should_generate index && pandoc --template template.html index.md >index.html
should_generate rules && pandoc --template template.html rules.md >rules.html
...
We can do that for every command to speed-up the web generation.
But.
That is a lot of work. And probably the time saved would be all wasted by rewriting our script. Not mentioning the fact that the result looks horrible. And it is rather expensive to maintain.
Also, we often need to build just a part of the project: e.g., regenerate documentation only (without publishing the result, for example). Although extending the script along the following way is possible, it certainly is not viable for large projects.
if [ -z "${1:-}" ]; then
... # build here
elif [ "${1:-}" = "clean" ]; then
rm -f index.html rules.html score.html
elif [ "${1:-}" = "publish" ]; then
cp index.html rules.html score.html /var/www/web-page/
else
...
Luckily, there is a better way.
There are special tools, usually called build systems that have a single purpose: to orchestrate the build process. They provide the user with a high-level language for capturing the above-mentioned steps for building software.
In this lab, we will focus on make
.
make
is a relatively old build system, but it is still widely used.
It is also one of the simplest tools available: you need to specify
most of the things manually but that is great for learning.
You will have full control over the process and you will see what is
happening behind the scene.
make
Move into root directory of (the local clone of your fork of) the web example repository first, please.
The files in this directory are virtually the same as in our shell script above,
but there is one extra file: Makefile
.
Notice that Makefile
is written with capital M to be easily distinguishable
(ls
in non-localized setup sorts uppercase letters first).
This file is a control file for a build system called make
that
does exactly what we tried to imitate in the previous example.
It contains a sequence of rules for building files.
We will get to the exact syntax of the rules soon, but let us play with them first. Execute the following command:
make
You will see the following output (if you have executed some of the commands manually, the output may differ):
pandoc --template template.html index.md >index.html
pandoc --template template.html rules.md >rules.html
make
prints the commands it executes and runs them.
It has built the website for us: notice that the HTML files
were generated.
For now, we do not generate the version.inc.html
file at all.
Execute make
again.
make: Nothing to be done for 'all'.
As you can see, make
was smart enough to recognize that since
no file was changed, there is no need to run anything.
Update index.md
(touch index.md
would work too) and run make
again.
Notice how index.html
was rebuilt while rules.html
remained
untouched.
pandoc --template template.html index.md >index.html
This is called an incremental build (we build only what was needed instead of building everything from scratch).
As we mentioned above: this is not much interesting in our tiny example. However, once there are thousands of input files, the difference is enormous.
It is also possible to execute make index.html
to ask for rebuilding
just index.html
. Again, the build is incremental.
If you wish to force a rebuild, execute make
with -B
.
Often, this is called an unconditional build.
It rebuilds only things that need rebuilding and, more interestingly,
it takes care of dependencies. For example, if scores.html
is generated
from scores.md
that is build from scores.csv
, we need only to specify
how to build scores.md
from scores.csv
and how to create scores.html
from scores.md
and make
will ensure the proper ordering.
Makefile
explained
Makefile
is a control file for the build system named make
.
In essence, it is a domain-specific language to simplify setting
up the script with the should_generate
constructs we
mentioned above.
(Usually, your editor will recognize that Makefile
is a special
file name and switch to tabs-only policy by itself.)
If you use spaces instead, you will typically get an error like
Makefile:LINE_NUMBER: *** missing separator. Stop.
.
The Makefile contains a sequence of rules. A rule looks like this:
index.html: index.md template.html
pandoc --template template.html index.md >index.html
The name before the colon is the target of the rule.
That is usually a file name that we want to build.
Here, it is index.html
.
The rest of the first line is the list of dependencies – files from
which the target is built.
In our example, the dependencies are index.md
and template.html
.
In other words: when these files (index.md
and template.html
) are modified
we need to rebuild index.html
.
The third part are the following lines that has to be indented by tab.
They contain the commands that have to be executed for the target to be built.
Here, it is the call to pandoc
.
make
runs the commands if the target is out of date. That is, either the
target file is missing, or one or more dependencies are newer than the target.
The rest of the Makefile
is similar.
There are rules for other files and also several special rules.
Special rules
The special rules are all
, clean
, and .PHONY
.
They do not specify files to be built, but rather special actions.
all
is a traditional name for the very first rule in the file.
It is called a default rule and it is built if you run make
with
no arguments. It usually has no commands and it depends on all files
which should be built by default.
clean
is a special rule that has only commands, but no dependencies.
Its purpose is to remove all generated files if you want to clean up
your work space.
Typically, clean
removes all files that are not versioned
(i.e., under Git control).
This can be considered misuse of make
, but one with a long tradition.
From the point of view of make
, the targets all
and clean
are
still treated as file names. If you create a file called clean
, the
special rule will stop working, because the target will be considered
up to date (it exists and no dependency is newer).
To avoid this trap, you should explicitly tell make
that the target is not
a file. This is done by listing it as a dependency of the special target
.PHONY
(note the leading dot).
Generally, you can see that make
has a plenty of idiosyncrasies.
It is often so with programs which started as a simple tool and underwent
40 years of incremental development, slowly accruing features. Still,
it is one of the most frequently used build systems. Also, it often serves
as a back-end for more advanced tools – they generate a Makefile
from a more friendly specification and let make
do the actual work.
Exercise
Improving the maintainability of the Makefile
The Makefile
starts to have too much of a repeated code.
But make
can help you with that too.
Let’s remove all the rules for generating out/*.html
from *.md
and replace them with:
out/%.html: %.md template.html
pandoc --template template.html -o $@ $<
That is a pattern rule that captures the idea that HTML is generated from Markdown. the percent sign in the dependencies and target specification represents so called stem – the variable (i.e., changing) part of the pattern.
In the command part, we use make
variables.
make
variables start with dollar as in shell but they are not the same.
$@
is the actual target and $<
is the first dependency.
Run make clean && make
to verify that even with pattern rules,
the web is still generated.
Apart from pattern rules, make
also understands (user) variables.
They can improve readability as you can separate configuration from
commands. For example:
PAGES = \
out/index.html \
out/rules.html \
out/score.html
all: $(PAGES) ...
...
Note that unlike in the shell, variables are expanded by the $(VAR)
construct. (Except for the special variables such as $<
.)
Non-portable extensions
make
is a very old tool that exists in many different implementations.
The features mentioned so far should work with any version of make
.
(At least a reasonably recent one. Old make
s did not have .PHONY
or pattern rules.)
The last addition will work in GNU make only (but that is the default on Linux so there shall not be any problem).
We will change the Makefile
as follows:
PAGES = \
index \
rules \
score
PAGES_TMP=$(addsuffix .html, $(PAGES))
PAGES_HTML=$(addprefix out/, $(PAGES_TMP))
We keep only the basename of each page and we compute the output
path. $(addsuffix ...)
and $(addprefix ...)
are calls to built-in
functions. Formally, all function arguments are strings, but in this case,
comma-separated names are treated as a list.
Note that we added PAGES_TMP
only to improve readability when
using this feature for the first time.
Normally, you would only have PAGES_HTML
assigned directly to this.
PAGES_HTML=$(addprefix out/, $(addsuffix .html, $(PAGES)))
This will prove even more useful when we want to generate a PDF for each page, too.
We can add a pattern rule and build the list of PDFs using $(addsuffix .pdf, $(PAGES))
.
Tasks to check your understanding
We expect you will solve the following tasks before attending the labs so that we can discuss your solutions during the lab.
Learning outcomes and after class checklist
This section offers a condensed view of fundamental concepts and skills that you should be able to explain and/or use after each lesson. They also represent the bare minimum required for understanding subsequent labs (and other courses as well).
Conceptual knowledge
Conceptual knowledge is about understanding the meaning and context of given terms and putting them into context. Therefore, you should be able to …
-
name several steps that are often required to create distributable software (e.g. a package or an installer) from source code and other basic artifacts
-
explain why software build should be a reproducible process
-
explain how it is possible to capture a software build
-
explain concepts of languages that are used for capturing steps needed for software build (distribution)
-
explain what is a regular expression (regex)
Practical skills
Practical skills are usually about usage of given programs to solve various tasks. Therefore, you should be able to …
-
build
make
-based project with default settings -
create
Makefile
that drives build of a simple project -
use wildcard rules in
Makefile
-
optional: use variables in a
Makefile
-
optional: use basic GNU extensions to simplify complex
Makefile
s -
create and use simple regular expressions to filter text with
grep
-
perform pattern substitution using
sed
This page changelog
- 2025-04-08: Update AFS permission command.