Vstup, výstup a pipes (4) | Cvičení | NSWI177

Informace níže se nevztahují k současnému semestru. Stránka pro aktuální semestr je zde.

Cvičení: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.

Cílem tohoto cvičení je definovat a do hloubky pochopit, co je to standardní vstup (stdin), standardní výstup (stdout) a standardní chybový výstup (stderr). To nám umožní porozumět tomu, jak funguje přesměrování vstupu a výstupu (I/O redirection) a propojování programů pomocí pipes (český ekvivalent “pipe” je doslova “roura”, ale tento pojem se v praxi téměř nepoužívá). Také si přizpůsobíme chování našeho shellu - prozkoumáme, jak fungují aliasy a .bashrc.

Čtení před cvičením

Prozkoumáme několik konceptů, které pro vás možná budou nové. Prosíme, nepřeskakujte teorii, budeme na ní stavět v dalších cvičeních.

Standardní vstup a výstup

Standardní výstup

Standardní výstup (obvykle se mu říká stdout, což je zkratka ze “standard output”) se použije, když zavoláte print("Hello") třeba v Pythonu - je to cíl, kam se vytiskne výstup vašeho programu. Stdout používají prakticky všechny základní výstupní funkce (funkce, které se používají třeba pro vytištění textu na obrazovku) v téměř každém programovacím jazyce.

V zásadě se stdout chová jako soubor - říkáme, že má stejné API pro zápis, jako obyčejný soubor. Proto se na něj zapisuje stejně bez ohledu na to, z jakého jazyka - print z Pythonu, System.out.print v Javě i printf v Céčku (kde existuje pár funkcí print a printf kvůli vlastnostem C, do kterých nemá smysl zde zabíhat) - všechna tahle volání se k stdoutu chovají stejně. Dokonce se k němu chovají v zásadě stejně bez ohledu na to, zda stdout míří do souboru, je to terminál (takže “píše na obrazovku”) nebo je vstupem jiného programu.

Když se v Linuxu spustí program, dostane stdout už připravený. O to se postará shell ve spolupráci s operačním systémem, někdy se toho účastní i runtime jazyka, ve kterém byl program napsán. Ani zde nemá smysl zabíhat do podrobností; v praxi je pouze důležité vědět, že stdout dostane program už připravený a co napíšeme na stdout, objeví se na obrazovce v terminálu (a pokud jsme aplikaci spustili graficky, většinou o data zapsaná na stdout přijdeme).

V Pythonu můžete k otevřenému souboru, který reprezentuje stdout, přistoupit pomocí sys.stdout. Jedná se o stejný objekt, jaký vrací volání open.

Standardní vstup

Podobně jako můžete zapisovat na stdout v téměř každém jazyce, můžete číst ze standardního vstupu (kterému se říká stdin ze standard input). Obvykle lze ze stdin přečíst to, co napíšete na klávesnici do terminálu (a pouze do terminálu, grafické aplikace se ke klávesnici chovají jinak).

Všimněte si, že funkce input(), kterou jste asi v Pythonu používali je nadstavbou stadnardního vstupu, protože umožní vstup upravovat. Obyčejný stdin nepodporuje žádné úpravy (resp. umí mazat znaky od konce řádku).

Pokud chcete použít standardní vstup přímo, můžete použít sys.stdin. Asi vás nepřekvapí, že i stdin se chová jako otevřený soubor, a sys.stdin je stejný objekt, jaký vrací volání open. A tak můžete například použít metodu .readline() pro přečtení jednoho řádku, nebo iterovat přes všechny řádky na standardním vstupu.

Iterace přes všechny řádky vstupu je velmi běžné chování, které je společné mnoha Linuxovým programům (ty jsou obvykle napsané v Céčku, ale chovají se stejně).

for line in sys.stdin:
    ...

Všimněte si, že tahle smyčka funguje stejně dobře s libovolným otevřeným textovým souborem v Pythonu, a textový vstup se takhle skutečně obvykle zpracovává.

Přesměrování vstupu a výstupu (“I/O redirection”)

Okrajově jsme se zmínili o tom, že standardní vstup a výstup jsou pro program připraveny předtím, než je spuštěn. To ale znamená, že je také možné je vyměnit, aniž bychom museli upravovat vlastní program. Program ani nepotřebuje vědět, zda čte vstup z klávesnice či ze souboru, nebo zda píše na obrazovku či do souboru - API je pořád stejné.

Když změníme standardní vstup nebo výstup našeho programu, říká se tomu přesměrování (standardního vstupu resp. standardního výstupu). Přesměrování nám například umožňuje vyjádřit, že výstup se nemá objevit na obrazovce, ale má být uložen do souboru.

Přesměrování musí proběhnout předtím, než se program spustí, a musí se o něj postarat ten, kdo program spouští - tedy shell, což je pro nás v tuto chvíli podstatné.

Zařídit přesměrování je jednoduché: na konec příkazu můžeme napsat třeba > output.txt a všechno, co by se normálně objevilo na obrazovce v terminálu, se teď objeví v souboru output.txt.

Dejte si pozor na to, že přesměrování nevratně přepisuje existující soubory. Pokud nemáte zálohu a přepíšete si soubor tak, že do něj přesměrujete výstup jiného příkazu, o svá data přijdete. Neexistuje žádná “Undo” funkce, která by váš soubor obnovila do původního stavu (a “recovery” malých souborů na Linuxu bude s většinou filesystémů obtížná nebo nemožná). Je dobré si zvyknout mačkat “Tab” také při psaní názvu souborů, do kterých přesměrováváte výstup; pokud existují, shell název doplní (nebo minimálně mezeru na konec, pokud jste jméno souboru napsali celé a on existuje). Tak víte, že se chystáte přepsat existující soubor.

Jako nejjednodušší příklad můžeme spustit následující dva příkazy, které vytvoří soubory one.txt a two.txt, ve kterých bude po řadě napsáno ONE resp. TWO (a znak konce řádku jako poslední znak).

echo ONE > one.txt
echo TWO >two.txt

Syntaxe shellu je poměrně volná, takže je možné vynechat mezeru před one.txt a napsat >one.txt. Mezera za znakem > tedy není součástí názvu souboru, do kterého přesměrováváme.

Z pohledu implementace: echo dostalo jediný argument. Část s přesměrováním (> filename) mu nebyla nijak předána a > filename budete v sys.argv tudíž hledat marně.

Pokud znáte volání popen z Pythonu, možná víte, že umožňují specifikovat, co se má použít jako stdout. To se hodí právě pro přesměrování v nově spouštěných programech (už běžícím programům se takhle stdin/stdout přesměrovat nedá).

Ve druhém cvičení jsme se zmínili, že program cat se používá na “zřetězování” souborů (vytištění několika souborů za sebou). Teď, když umíme přesměrovat výstup, začíná to dávat smysl: výstup cat, tedy několik souborů vytištěných po řadě za sebou, se dá jednoduše uložit do nového souboru.

cat one.txt two.txt >merged.txt

Přesměrování je více druhů. Například můžete použít < pro přesměrování stdin. Tím pádem, namísto čtení toho, co uživatel píše na klávesnici do terminálu, budete číst vstup ze souboru.

Pythonovské programy, které používají input() moc dobře s přesměrovaným vstup nefungují. Prakticky je input() vhodný pouze pro interaktivní programy. Spíše asi budete chtít použít sys.stdin.readline() nebo for line in sys.stdin.

Filtry

Mnoho příkazů v Linuxu funguje jako filtry. Čtou svůj vstup ze stdin a (modifikovaný) výstup zapisují na stdout.

Jeden takový příklad je cut, příkaz, který vytiskne jenom určité sloupce vstupu a ostatní zahodí. Například když spustíme cut -d: -f1 a standardní vstup bude /etc/passwd, dostaneme seznam účtů (uživatelských jmen) na aktuálním stroji.

Zkuste vysvětlit rozdíl mezi těmito voláními:

cut -d: -f 1 </etc/passwd
cut -d: -f 1 /etc/passwd

Filtry se často chovají přesně takhle: pokud nespecifikujete jméno souboru, filtr pracuje se standardním vstupem. Jestliže specifikujete soubor jako argument, bude filtr operovat na něm.

K předchozí otázce: rozdíl je v tom, že v prvním případě (s přesměrováním vstupu) otevře /etc/passwd shell a otevřený soubor (nějak) předá programu cut, který spustí. Pokud se soubor nepodaří otevřít, selhání ohlásí shell a cut se nejspíš ani nespustí. Ve druhém případě otevírá /etc/passwd sám běžící cut, který dostane jméno souboru jako argument - zavolá open a obslouží případné chyby při otevírání souboru sám.

Takhle by se měly chovat i filtry, které si napíšete sami: množství kódu, které je k tomu potřeba, je zanedbatelné, ale uživateli to dává značnou flexibilitu v tom, jak váš program může používat. Následující příklad ukazuje, jak snadno se dá tohle chování implementovat v programu rev.py který tiskne řádky pozadu (obsluhu chyb jsme vynechali pro stručnost, za chvilku to napravíme).

#!/usr/bin/env python3

import sys

def reverse(inp):
    for line in inp:
        print(line.rstrip('\n')[::-1])

def main():
    if len(sys.argv) == 1:
        reverse(sys.stdin)
    else:
        for filename in sys.argv[1:]:
            with open(filename, "r") as inp:
                reverse(inp)

if __name__ == '__main__':
    main()

Mimochodem, existuje příkaz rev, který dělá přesně tohle. Je to běžná Linuxová utilita.

Standardní chybový výstup

Zatímco výstup programu chceme často přesměrovat někam jinam, chyby při běhu programu bychom rádi viděli na obrazovce.

Dejme tomu, že existují soubory one.txt a two.txt, ale nonexistent.txt v aktuálním adresáři není. Spustíme následující příkaz.

cat one.txt nonexistent.txt two.txt >merged.txt

cat celkem pochopitelně ohlásí chybu - neexistující soubor přečíst nejde. Kdyby se ale tahle chybová hláška přesměrovala pryč, neviděli bychom ji.

Co hůř, chybu bychom měli někde v souboru merged.txt, což není příliš užitečné.

A proto má každý Linuxový program ještě jeden výstup, standardní chybový výstup, kterému se říká stderr (ze standard error [output]). Stderr je obvykle inicializovaný stejně jako stdout, ale logicky je odlišný. Když přesměrujeme stdout pomocí >, stderr to neovlivní.

V Pythonu je stderr dostupný jako sys.stderr. Opět se chová jako otevřený soubor.

Pojďme rozšířit rev.py, aby používal stderr:

def main():
    if len(sys.argv) == 1:
        reverse(sys.stdin)
    else:
        for filename in sys.argv[1:]:
            try:
                with open(filename, "r") as inp:
                    reverse(inp)
            except FileNotFoundError as e:
                print("{}: No such file".format(filename), file=sys.stderr)

Standardní chybový výstup se taky často používá pro tisk ladících hlášek a logování, a to z prostého důvodu - výstup jde na obrazovku, takže je možné informovat uživatele o tom, co se právě děje, aniž by to měnilo skutečný výstup programu.

Pod kapotou

S každým otevřeným souborem je asociovaný takzvaný file descriptor. File descriptory se používají jako identifikátory otevřených souborů při komunikaci s operačním systémem (říkali jsme, že vlastní souborové operace vykonává operační systém). File descriptor je celé číslo, které slouží jako index do tabulky aktuálně otevřených souborů, kterou si operační systém udržuje pro každý proces (proces je pro nás běžící instance programu).

Například syscall write dostane dva argumenty: file descriptor a buffer s daty, která má operační systém zapsat (v příkladech je teď budeme pro jednoduchost předávat přímo jako string). Takže když váš program zavolá print("Message", file=some_file), ve skutečnosti se “někde vespod” provede volání write(3, "Message\n"), kde 3 je file descriptor asociovaný s otevřeným souborem some_file.

Tohle je trochu technické, ale pomůže vám to pochopit, proč přesměrování standardního chybového výstupu vypadá tak, jak vypadá, a proč většina operací se soubory vyžaduje, aby byl příslušný soubor nejprve otevřen (tedy proč write_to_file(filename, contents) není elementární operace).

V unixovém prostředí jsou file descriptory 0, 1 a 2 po řadě přiřazeny standardnímu vstupu, výstupu a chybovému výstupu. Proto print("Message") v Pythonu nakonec zavolá write(1, "Message\n") a volání print("Error", file=sys.stderr) volá write(2, "Error\n").

To také vysvětluje, proč je přesměrování vůbec možné – volající (shell) může připravit file descriptor 1 tak, aby byl asociovaný s jiným souborem. Normální aplikace používá file descriptor, a tak nepozná rozdíl.

Protože stdout a stderr jsou logicky rozdílné streamy (soubory), drobně se liší i konfigurací: přestože výstup ze stdout a stderr může skončit na jednom místě, standardní výstup je obvykle bufferovaný (na obrazovku se vytiskne teprve ve chvíli, kdy je ho připraveno dost), zatímco standardní chybový výstup se tiskne hned. Důvod je zjevný: chybové zprávy chceme vidět hned, ale jsme ochotní pozdržet standardní výstup výměnou za vyšší efektivitu.

Bufferování se obecně může chovat složitěji. Stačí ale myslet na to, že zatímco standardní chybový výstup obvykle vidíme hned, standardní výstup může být zpožděný.

Návratová hodnota programu (exit code)

Doposud naše programy chybu ohlašovaly tak, že zapsaly chybovou zprávu na standardní chybový výstup. To je celkem užitečné pro interaktivní programy, protože uživatel chce vědět, co se pokazilo.

Nicméně pro neinteraktivní použití by bylo nepohodlné rozpoznávat chyby tak, že bychom hledali chybové hlášky na standardním chybovém výstupu. Chybové hlášky se mění, mohou být lokalizované atd. Existuje proto jiný způsob, jak můžeme poznat, zda běh programu skončil chybou či nikoli.

Úspěch či neúspěch poznáme podle exit code. Exit code je celé číslo a na rozdíl od jiných programovacích jazyků indikuje 0 úspěch a libovolný nenulový kód indikuje chybu.

Uhodnete, proč autoři zvolili nulu pro indikaci úspěchu (zatímco v jiných jazycích je logická hodnota nuly false), zatímco nenulové kódy (jejichž logická hodnota je obvykle true) byly použity pro chyby? Nápověda: kolika způsoby může program uspět?

Není-li řečeno jinak, když váš program úspěšně doběhne (například v Pythonu doběhne main a nedojde k vyhození výjimky), exit code by měl být nula.

Chcete-li změnit toho chování, můžete program ukončit s jiným chybovým kódem jako argumentem; například v Céčku voláním exit nebo v Pythonu voláním sys.exit.

V programech v C může main vracet buď void nebo int. V případě intové varianty je návratová hodnota main()u exit kódem.

You should always use the int variant and always specify a proper exit code.

Pro ilustraci, následující je modifikace našeho programu rev, ale tentokrát s korektní obsluhou chyb a nenulovým exit code.

def main():
    exit_code = 0
    if len(sys.argv) == 1:
        reverse(sys.stdin)
    else:
        for filename in sys.argv[1:]:
            try:
                with open(filename, "r") as inp:
                    reverse(inp)
            except FileNotFoundError as e:
                print("{}: No such file".format(filename), file=sys.stderr)
                exit_code = 1
    sys.exit(exit_code)

Později uvidíme, že různé control-flow konstrukce v shellu (podmínky a smyčky) se řídí právě tím, jaký byl exit code jednotlivých příkazů.

Kvíz před cvičením

Soubor s kvízem je ve složce 04 v tomto GitLabím projektu.

Zkopírujte si správnou jazykovou mutaci do vašeho projektu jako 04/before.md (tj. budete muset soubor přejmenovat).

Otázky i prostor pro odpovědi jsou v souboru, odpovědi vyplňte mezi značky **[A1]** a **[/A1]**.

Pipeline before-04 na GitLabu zkontroluje, že jste odevzdali odpovědi ve správném formátu. Ze zřejmých důvodů nemůže zkontrolovat skutečnou správnost.

Odevzdejte kvízy před začátkem dalšího cvičení.

Úlohy po cvičení

Tyto budou zveřejněny příští pondělí.

Přesměrování v praxi

Prepare files one.txt and two.txt containing the words ONE, TWO respectively using echo and stdout redirection. Answer.

Merge (concatenate) these two files into merged.txt. Answer.

Appending to the end of a file

The shell also offers an option to append the output to an existing file using the >> operator. Thus, the following command would add UNO as another line into one.txt.

echo UNO >>one.txt

If the file does not exist, it will be created.

For the following example, we will need the program tac that reverses the order of individual lines but otherwise works like cat. Try this first.

tac one.txt two.txt

If you have executed the commands above, you should see the following:

UNO
ONE
TWO

Try the following and explain what happens (and why) if you execute

tac one.txt two.txt >two.txt

Answer.

Přesměrování vstupu

Copy the rev program from above and run it like this:

Hint.

./rev.py <one.txt
./rev.py one.txt
./rev.py one.txt two.txt
./rev.py one.txt <two.txt

Has it behaved as you expected?

Trace which paths (i.e. through which lines) the program has taken with the above invocations.

Redirecting standard error output

To redirect the standard error output, you can use > again, but this time preceded by the number 2 (that denotes the stderr file descriptor).

Hence, our cat example can be transformed to the following form where err.txt would contain the error message and nothing would be printed on the screen.

cat one.txt nonexistent.txt two.txt >merged.txt 2>err.txt

Důležité speciální soubory

We already mentioned several important files under /dev/. With output redirection, we can actually use some of them right away.

Run cat one.txt and redirect the output to /dev/full and then to /dev/null. What happened?

Especially /dev/null is a very useful file as it can be used in any situation when we are not interested in the output of a program.

For many programs you can specify the use of stdin explicitly by using - (dash) as the input filename.

Another option is to use /dev/stdin explicitly: with this name, we can make the example with rev work:

./rev.py /dev/stdin one.txt <two.txt

Then Python opens the file /dev/stdin as a file and operating system (together with shell) actually connects it with two.txt.

/dev/stdout can be used if we want to specify standard output explicitly (this is mostly useful for programs coming from other environments where the emphasis is not on using stdout that much).

Obecné přesměrování

Shell allows us to redirect outputs quite freely using file descriptor numbers before and after the greater-than sign.

For example, >&2 specifies that the standard output is redirected to a standard error output. That may sound weird but consider the following mini-script.

Here, wget used to fetch file from given URL.

echo "Downloading tarball for lab 02..." >&2
wget https://d3s.mff.cuni.cz/f/teaching/nswi177/202122/labs/nswi177-lab02.tar.gz 2>/dev/null

We actually want to hide the progress messages of wget and print ours instead.

Take this as an illustration of the concept as wget can be silenced via command-line arguments (--quiet) as well.

Sometimes, we want to redirect stdout and stderr to one single file. In these situations simple >output.txt 2>output.txt would not work and we have to use >output.txt 2>&1 or &>output.txt (to redirect both at once). However, what about 2>&1 >output.txt, can we use it as well? Try it yourself! Hint.

Pipes (data streaming composition)

We finally move to the area where Linux excels: program composition. In essence, the whole idea behind Unix-family of operating systems is to allow easy composition of various small programs together.

Mostly, the programs that are composed together are filters and they operate on text inputs. These programs do not make any assumptions on the text format and are very generic. Special tools (that are nevertheless part of Linux software repositories) are needed if the input is more structured, such as XML or JSON.

The advantage is that composing the programs is very easy and it is very easy to compose them incrementally too (i.e., add another filter only when the output from the previous ones looks reasonable). This kind of incremental composition is more difficult in normal languages where printing data requires extra commands (here it is printed to the stdout without any extra work).

The disadvantage is that complex compositions can become difficult to read. It is up to the developer to decide when it is time to switch to a better language and process the data there. A typical division of labour is that shell scripts are used to preprocess the data: they are best when you need to combine data from multiple files (such as hundreds of various reports, etc.) or when the data needs to be converted to a reasonable format (e.g. non-structured logs from your web server into a CSV loadable into your favorite spreadsheet software or R). Computing statistics and similar tasks are best left to specialized tools.

Needless to add, Linux offers plenty of tools for statistical computations or plot drawing utilities that can be controlled in CLI. Mastering of these tools is, unfortunately, out of topic for this course.

Motivation example

As a somewhat artificial example, we will consider the following CSV that can be downloaded from here.

These are actual data representing how long it took to copy the USB disk image to the USB drives in the library. The first column represents the device, the second duration of the copying.

As a matter of fact, the first column also indirectly represents port of the USB hub (this is more by accident but it stems from the way we organized the copying). As a sidenote: it is interesting to see that some ports that are supposed to be the same are actually systematically slower.

disk,duration
/dev/sdb,1008
/dev/sdb,1676
/dev/sdc,1505
/dev/sdc,4115
...

We want to know what was the longest duration of the copying: in other words, the maximum of column two.

Well, we could use spreadsheet software for that, but we prefer to stay in the terminal. Among other reasons, we want a solution which is easily repeatable with other input files.

Recall that you have already seen the cut command that is able to extract specific columns from a file. There is also the command sort that sorts lines.

Thus our little script could look like this:

#!/bin/bash

cut -d, -f 2 <disk-speeds-data.csv >/tmp/disk_numbers.txt
sort </tmp/disk_numbers.txt

Prepare this script and run it.

The output is far from perfect: sort has sorted the lines alphabetically, not by numeric values. However, a quick glance at man sort later, we add -n (a.k.a. --numeric-sort) and re-execute the script.

This time, the last line of the output shows the maximum duration of 5769 seconds. Of course, all the other lines are useless, but we will fix that in a minute.

Let us focus on the temporary file first. There are two issues with it: First of all, it requires disk space for another copy of the (possibly huge) data. A bit more subtle but much more dangerous problem is that the path to the temporary file is fixed. Imagine what happens if you execute the script in two terminals concurrently. Do not be fooled by the feeling that the script so short that the probability of concurrent execution is negligible. It is a trap that is waiting to spring. We will talk about proper use of mktemp(1) later, but in this example no temporary file is needed at all. We can write:

cut -d, -f 2 <disk-speeds-data.csv | sort

The | symbol stands for a pipe, which connects the standard output of cut to the standard input of sort. The pipe passes data between the two processes without writing them to the disk at all. (Technically, the data are passed using memory buffers, but that is a technical detail.)

The result is the same, but we escaped the pitfalls of using temporary files and the result is actually even more readable. You can even move the first < before cut, so that the script can be read left-to-right like “take disk-speeds-data.csv, extract the second column, and then sort it”:

<disk-speeds-data.csv cut -d, -f 2 | sort

In essence, the family of unix systems is built on top of the ability of creating pipelines, which chain a sequence of programs using pipes. Each program in the pipeline denotes a type of transformation. These transformations are composed together to produce the final result.

Finally, let us recall that we wanted to print only the biggest number. We can use the tail utility which prints only the last few lines of a file: by default 10, but you can ask for just one by adding -n 1. As pipelines are not limited to two programs, we can simply write:

cut '-d,' -f 2 | sort -n | tail -n 1

Note that we have removed the path to the input file from the script. Now, the user is supposed to run it like:

get-slowest.sh <disk-speeds-data.csv

This actually makes the script more flexible: it is easy to test such a script with different inputs and the script can be again used as a part of a bigger pipeline.

Using `&&` and `||` (logical program composition)

Execute the following commands:

ls / && echo "ls okay"
ls /nonexistent-filename || echo "ls failed"

This is an example of how return codes can be used in practice. We can chain commands to be executed only when the previous one failed or terminated with zero exit code.

Understanding the following is essential, because together with pipes and standard I/O redirection, it forms the basic building blocks of shell scripts.

First of all, we will introduce a syntax for conditional chaining of program calls.

If we want to execute one command only if the previous one succeeded, we separate them with && (i.e., it is a logical and) On the other hand, if we want to execute the second command only if the first one fails (in other words, execute the first or the second), we separate them with ||.

The example with ls is quite artificial as ls is quite noisy when an error occurs. However, there is also a program called test that is silent and can be used to compare numbers or check file properties. For example, test -d ~/Desktop checks that ~/Desktop is a directory. If you run it, nothing will be printed. However, in company with && or ||, we can check its result.

test -d .git && echo "We are in a root of a Git project"
test -f README.md || echo "README.md missing"

This could be used as a very primitive branching in our scripts. In the next lab, we will introduce proper conditional statements, such as if and while.

Note that test is actually a very powerful command – it does not print anything but can be used to control other programs.

It is possible to chain commands, && and || are left-associative and they have the same priority.

Compare the following commands and how they behave when in a directory where the file README.md is or is not present:

test -f README.md || echo "README.md missing" && echo "We have README.md"
test -f README.md && echo "We have README.md" || echo "README.md missing"

Failing fast

There is a caveat regarding pipes and success of commands: the success of a pipeline is determined by its last command. Thus, sort /nonexistent | head is a successful command. To make a failure of any command fail the (whole) pipeline, you need to run set -o pipefail in your script (or shell) before the pipeline.

Compare the behavior of the following two snippets.

sort /nonexistent | head && echo "All is well"

set -o pipefail
sort /nonexistent | head && echo "All is well"

In most cases, you want the second behavior.

Actually, you typically want the whole script to terminate if there is an unexpected failure. This means a failure, which was not tested by the && or || operator (or one of the conditional statements we meet in the next lab). Like an uncaught exception in Python.

For example, the following compound command is successful even though one of its components failed:

cat /nonexistent || echo "Oh well"

To enable terminate-on-failure, you need to call set -e. In case of failure, the shell will stop executing the script and exit with the same exit code as the failed command.

Furthermore, you usually want to terminate the script when an uninitiailized variable is used: that is enabled by set -u. (We will talk about variables later.)

Therefore, typically, you want to start your script with the following trio:

set -o pipefail
set -e
set -u

Many commands allow short options (such as -l or -h you know from ls) to be merged like this (note that -o pipefail has to be last):

set -ueo pipefail

Get into a habit where each of your scripts starts with this command.

Actually, from now on, the GitLab pipeline will check that this command is a part of your scripts.

Pitfalls of pipes (a.k.a. SIGPIPE)

set -ueo pipefail can cause unwanted and quite unexpected behaviour.

Following script terminates with a hard-to-explain error (note that the final hexdump is only to ensure we do not print garbage from /dev/urandom directly on the terminal).

set -ueo pipefail
cat /dev/urandom | head -n 1 | hexdump || echo "Pipe failed?"

Despite the fact that everything looks fine.

The reason comes from the head command. head has a very smart implementation that terminates after first -n lines were printed. Reasonable right? But that means that the first cat is suddenly writing to a pipe that no one reads. It is like writing to a file that was already closed. That generates an exception (well, kind of) and cat terminates with an error. Because of set -o pipefail, the whole pipeline fails.

The truth is that distinguishing whether the closed pipe is a valid situation that shall be handled gracefully or if it indicates an issue is impossible. Therefore cat terminates with an error (after all, someone just closed its output without letting it know first) and thus the shell has to mark the whole pipeline as failed.

Solving this is not always easy and several options are available. Each has its pros and cons.

When you know why this can occur, adding || true marks the pipeline as fine.

Shell customization

We already mentioned that you should customize your terminal emulator to be comfortable to use. After all, you will spend at least this semester with it and it should be fun to use.

In this lab, we will show some other options how to make your shell more comfortable to use.

Command aliases

You probably noticed that you execute some commands with the same options a lot. One such example could be ls -l -h that prints a detailed file listing, using human-readable sizes. Or perhaps ls -F to append a slash to the directories. And probably ls --color too.

Shell offers to create so-called aliases where you can easily add new commands without creating full-fledged scripts somewhere.

Try executing the following commands to see how a new command l could be defined.

alias l='ls -l -h`
l

We can even override the original command, the shell will ensure that rewriting is not a recursive.

alias ls='ls -F --color=auto'

Note that these two aliases together also ensure that l will display filenames in colors.

There are no spaces around the equal sign.

Some typical aliases that you will probably want to try are the following ones. Use a manual page if you are unsure what the alias does. Note that curl is used to retrieve contents from a URL and wttr.in is really a URL. By the way, try that command even if you do not plan to use this alias :-).

alias ls='ls -F --color=auto'
alias ll='ls -l'
alias l='ls -l -h'

alias cp='cp -i'
alias mv='mv -i'
alias rm='rm -i'

alias man='man -a'

alias weather='curl wttr.in'

`~/.bashrc`

Aliases above are nice, but you probably do not want to define them each time you launch the shell. However, most shells in Linux have some kind of file that they execute before they enter interactive mode. Typically, the file resides directly in your home directory and it is named after the shell, ending with rc (you can remember it as runtime configuration).

For Bash that we are using now (if you are using a different shell, you probably already know where to find its configuration files), that file is called ~/.bashrc.

You have already used it when setting EDITOR for Git, but you can also add aliases there. Depending on your distribution, you may already see some aliases or some other commands there.

Add aliases you like there, save the file and launch a new terminal. Check that the aliases work.

The .bashrc file behaves as a shell script and you are not limited to have only aliases there. Virtually any commands can be there that you want to execute in every terminal that you launch.

More examples

The following examples can be solved either by executing multiple commands or by piping basic shell commands together. To help you find the right program, you can use manual pages. You can also use our manual as a starting point.

Create a directory a and inside it create a text file --help containing Lorem Ipsum. Print the content of this file and then delete it. Solution.

Create a directory called b and inside it create files called alpha.txt and *. Then delete the file called * and watch out what happened to the file alpha.txt. Solution.

Print the content of the file /etc/passwd sorted by the rows. Solution.

The command getent passwd USERNAME prints the information about user account USERNAME (e.g., intro) on your machine. Write a command that prints information about user intro or a message This is not NSWI177 disk if the user does not exist. Solution.

Print the first and third column of the file /etc/group. Solution.

Count the lines of the file /etc/services. Solution.

Print last two lines of the files /etc/passwd and /etc/group using a single command. Solution.

Recall the file disk-speeds-data.csv with the disk copying durations. Compute the sum of all durations. Solution.

Předpokládejme následující formát souboru.

Alpha     8  4  5  0
Bravo    12  5  3  2
Charlie   1  0 11  4

Append to each row sum of its line. You do not need to keep the original alignment (i.e., feel free to squeeze the spaces). Hint. Solution.

Print information about the last commit, when the script is executed in a directory that is not part of any Git project, the script shall print only Not inside a Git repository. Hint. Solution.

Print the contents of /etc/passwd and /etc/group separated by text Ha ha ha (i.e., contents of /etc/passwd, line with Ha ha ha and contents of /etc/group). Solution.

Hodnocené úlohy (deadline: 20. březen)

Nezapomínejte na správné nastavení executable bitu a shebang.

DŮLEŽITÉ: všechny tyto úlohy musí být vyřešení jen pomocí pipes a && a || skládání. Používejte standardní shellové programy, nepoužívejte shellové ify nebo while (cílem úloh je ověřit Vaše znalosti Linuxových filtrů).

`04/override.sh` (30 bodů)

Skript vypíše na stdout obsah souboru HEADER (v pracovním adresáři).

Pokud ale je v adresáři soubor .NO_HEADER, nic vypsáno nebude (i pokud HEADER existuje).

Pokud žádný ze souborů neexistuje, program vypíše Error: HEADER not found. na standardní chybový výstup a skončí s návratovou hodnotou 1.

Jinak skript končí úspěšně.

AKTUALIZACE: Kontrolu existence souboru můžete provést vícekrát. A můžete předpokládat, že soubory se nezmění, zatímco Váš skript běží. Našli jsme také drobnou chybu v našich testech, prosím, překontrolujte si, že Vaše řešení i nadále prochází.

`04/second_highest_uid.sh` (30 bodů)

Napište skript, který bude číst na standardním vstupu soubor naformátovaný jako passwd a vypíše druhé nejvyšší uživatelské číslo (numerical user ID).

Formát souboru je popsán v páté sekci manuálových stránek passwd.

Pro testování můžete skriptu předhodit váš /etc/passwd. Naše testy použijí uměle vytvořená data, abychom vaše řešení otestovali pořádně.

Můžete předpokládat, že IDčka jsou jednoznačná a v souboru budou vždy alespoň dvě položky.

`04/row_sum.sh` (40 bodů)

Předpokládejme, že mám matici zapsané v takovémhle “krásném” formátu. Můžete se spolehnout, že formát je pevně daný (s ohledem na mezery, maximálně trojciferné číslo a symbol pipe) ale může se lišit počet řádků i sloupců.

Napište skript, který sečte čísla v každém řádku.

Počítáme, že pro následující matici dostaneme tento výstup.

| 106 179 |
| 188  50 |
|   5 125 |

285
238
130

Skript bude vstup číst na stdinu, počet sloupců a řádků není nijak omezen (kromě celkového formátu).

Učební výstupy

Znalosti konceptů

Znalost konceptů znamená, že rozumíte významu a kontextu daného tématu a jste schopni témata zasadit do většího rámce. Takže, jste schopni …

vysvětlit, co je standardní výstup a vstup
vysvětlit, proč přesměrování standardního vstupu/výstupu není (přímo) viditelné uvnitř programu
vysvětlit, proč je standardní chybový výstup odlišný od standardního výstupu
vysvětlit, jak se liší cat foo.txt a cat <foo.txt
vysvětlit, jak může být více programů používajících stdio složeno dohromady
vysvětlit co je návratový kód programu (exit code) a jak může být použit
vysvětlit rozdíly a typické využití pro pět hlavních rozhraní, které může využít CLI program: argumenty, stdin, stdout, stderr a návratová hodnota (exit code)
vysvětlit, co je to deskriptor souboru (z pohledu aplikace, nikoliv OS/kernelu) (volitelné)

Praktické dovednosti

Praktické dovednosti se obvykle týkají použití daných programů pro vyřešení různých úloh. Takže, dokážete …

přesměrovat standardní výstup a vstup CLI programů
používat speciální soubor /dev/null
používat standardní vstup a výstup v Pythonu
používat pipe pro skládání programů
skládat programy pomocí && a || v shellových skriptech
používat základní filtry jako cut apod.
změnit návratovou hodnotu (exit code) pro Pythoní skripty
upravit si shell pomocí aliasů (volitelné)
upravit si konfiguraci shellu pomocí .bashrc a .profile skriptů (volitelné)

Note that this solution requires reading the input twice, hence we assume that the input is in score.txt.

<score.txt tr -s ' ' | cut -d ' ' -f 2- | tr ' ' '+' | bc | paste score.txt - | tr '\t' ' '

#!/usr/bin/env python3

import sys

def reverse(inp):
    for line in inp:
        print(line.rstrip('\n')[::-1])

def main():
    exit_code = 0
    if len(sys.argv) == 1:
        reverse(sys.stdin)
    else:
        for filename in sys.argv[1:]:
            try:
                with open(filename, "r") as inp:
                    reverse(inp)
            except FileNotFoundError as e:
                print("{}: No such file".format(filename), file=sys.stderr)
                exit_code = 1
    sys.exit(exit_code)

if __name__ == '__main__':
    main()