Lab #4 | Labs | NSWI177

Information below is not for the current semester. The current semester can be found here.

Labs: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.

The goal of this lab is to define and thoroughly understand the concepts of standard input, output, and standard error output. This would allow us to understand program I/O redirection and composition of different programs via pipes. We will also customize our shell environment a little by investigating command aliases and the .bashrc file.

Running example

We will build this lab around a single example that we will incrementally develop, so that you learn the basic concepts on a practical example (obviously, there are specific tools that could be used instead, but we hope that this is better than a completely artificial example).

Data for our example can be downloaded (i.e., git cloned) from this repository where they reside in the 04/ subdirectory.

They simulate simplified logs from a web server, where the web server records which files (URLs) were accessed at which time.

Practically, each file represents traffic for one day in a simplified CSV format.

Fields are separated by a comma, there is no header, and for each record we remember the date, the client’s IP address, the URL that was requested, and the amount of transferred bytes.

In reality, the data would be also compressed and would probably contain more details about the client (e.g., the browser used), but otherwise the data recorded represent a fairly typical web server log format.

Our task is to write a program that prints a brief summary of the data:

Print 3 most accessed URLs.
Print 3 days with the highest volume of traffic (i.e., the sum of transferred bytes).
Print total amount of data transferred.

Before we build the solution we need to lay some groundwork.

Standard input and outputs

We will start the lab with few definitions of concepts that you probably already know (but maybe not under exactly these names).

Standard output

Standard output (often shortened to stdout) is the default output that you can use by calling print("Hello") if you are in Python, for example. Stdout is used by the basic output routines in almost every programming language.

Generally, this output has the same API as if you were writing to a file. Be it print in Python, System.out.print in Java or printf in C (where the limitations of the language necessitate the existence of a pair of printf and fprintf).

This output is usually prepared by the language runtime together with the shell and the operating system (the technical details are not that important for this course anyway). Practically, the standard output is printed to the terminal or its equivalent (and when the application is launched graphically, stdout is typically lost).

Note that in Python you can access it explicitly via sys.stdout that acts as an opened file handle (i.e., result of open).

Standard input

Similarly to stdout, almost all languages have access to stdin that represents the default input. By default, this input comes from the keyboard, although usually through the terminal (i.e., stdin is not used in graphical applications for reading keyboard input).

Note that the function input() that you may have used in your Python programs is an upgrade on top of stdin because it offers basic editing functions. Plain standard input does not support any form of editing (though typically you could use backspace to erase characters at the end of the line).

If you want to access the standard input in Python, you need to use sys.stdin explicitly. As one could expect, it uses a file API, hence it is possible to read a line from it calling .readline() on it or to iterate through all lines.

In fact, the iteration of the following form is a quite common pattern for many Linux utilities (they are usually written in C but the pattern remains the same).

for line in sys.stdin:
    ...

Note that the above pattern actually works for any opened text file in Python and it is the preferred way to read a textual file.

Many of the utilities actually read from stdin by default. For example, cut -d : -f 1 prints only the first column of data of each line (and expects the columns to be delimited by :).

Run it and type the following on the keyboard, terminating each line with <Enter>.

cut -d : -f 1

one:two
alpha:bravo
uno:dos

You should see the first column echoed underneath your input.

What to do when you are done? Typing exit will not help here but <Ctrl>-D works.

Pressing <Ctrl>-D on an empty line will close the standard input. The program cut will realize that there is no more input to process and will gracefully terminate. Note that this is something else than <Ctrl>-C which forcefully kills the running process. From the user’s perspective, these look similar in the context of the utility cut, but the behavior is totally different with important semantics difference (that can be observed when using other tools).

Standard I/O redirection

As a technical detail, we mentioned earlier that the standard input and output are prepared (partially) by the operating system. This also means that it can be changed (i.e., initialized differently) without changing the program. And the program may not even “know” about it.

This is called redirection and it allows the user to specify that the standard output would not go to the screen (terminal), but rather to a file. From the point of view of the program, the API is still the same.

This redirection has to be done before the program is started and it has to be done by the caller. For us, it means we have to do it in the shell.

It is very simple: at the end of the command we can specify > output.txt and everything that would be normally printed on a screen goes to output.txt.

Before you start experimenting: the output redirection is a low-level operation and has no form of undo. Therefore, if the file you redirect to already exists, it will be overwritten without questions. And without any easy option to restore the original file content (and for small files, the restoration is technically impossible for most file systems used in Linux).

As a precaution, get into a habit to hit <Tab> after you specify the filename. If the file does not exist, the cursor will not move. If the file already exists, the tab completion routine will insert a space.

As the simplest example, the following two commands will create files one.txt and two.txt with the words ONE and TWO inside (including the new line character at the end).

echo ONE > one.txt
echo TWO >two.txt

Note that the shell is quite flexible in the use of spaces and both options are valid (i.e., one.txt does not have a space as the first character in the filename).

From implementation point of view, echo received a single argument, the part with > filename is not passed to the program at all (i.e., do not expect to find > filename in your sys.argv).

If you know Python’s popen or a similar call, they also offer the option to specify which file to use for stdout if you want to do a redirection in your program (but only for a new program launched, not inside a running program).

If you recall Lab 02, we mentioned that the program cat is used to concatenate files. With the knowledge of output redirection, it suddenly starts to make more sense as the (merged) output can be easily stored in a file.

cat one.txt two.txt >merged.txt

Appending in output redirection

The shell also offers an option to append the output to an existing file using the >> operator. Thus, the following command would add UNO as another line into one.txt.

echo UNO >>one.txt

If the file does not exist, it will be created.

For the following example, we will need the program tac that reverses the order of individual lines but otherwise works like cat (note that tac is cat but backwards, what a cool name). Try this first.

tac one.txt two.txt

If you have executed the commands above, you should see the following:

UNO
ONE
TWO

Try the following and explain what happens (and why) if you execute

tac one.txt two.txt >two.txt

Answer.

Input redirection

Similarly, the shell offers < for redirecting stdin. Then, instead of reading input typed by the user on the keyboard, the program reads the input from a file.

Note that programs using Pythonic input() do not work that well with redirected input. Practically, input() is suitable for interactive programs only. You might want to use sys.stdin.readline() or for line in sys.stdin instead.

When input is redirected, we do not need to issue <Ctrl>-D to close the input as the input is closed automatically when reaching the end of the file.

Standard input and output: check you understand the basics

Select all true statements. You need to have enabled JavaScript for the quiz to work.

Filters

Many utilities in Linux work as so-called filters. They accept the input from stdin and print their output to stdout.

One such example is cut that can be used to print only certain columns from the input. For example, running it as cut -d : -f 1 with /etc/passwd as its input will display a list of accounts (usernames) on the current machine.

Try to explain the difference between the following two calls:

cut -d : -f 1 </etc/passwd
cut -d : -f 1 /etc/passwd

The above behavior is quite common for most filters: you can specify the input file explicitly, but when it is missing, the program reads from the stdin.

To return to the question above: the difference is that in the first case (with input redirection), the input file is opened by the shell and opened file is passed to cut. Problems in opening the file are reported by shell and cut might not be launched at all. In the second case, the file is opened by cut (i.e., cut executes the open() call and also needs to handle errors).

Advancing the running example

Armed with this knowledge, we can actually solve the first part of our running example. Recall that we have files that logged traffic each day and we want to find URLs that are most common in all the files together.

That means we need to join all files together, keep only the URL and find the three most frequent lines.

And we can do that. Recall that cat can be used concatenate files and cut can be used to keep only certain columns. We will do finding the most frequent URL in a while.

So, how about this?

#!/bin/bash

cat logs/20[0-9][0-9]-[01][0-9]-[0-3][0-9].csv >_logs_merged.csv
cut -d , -f 5 <_logs_merged.csv

We have used a quite explicit wildcard to ensure we do not print some random CSVs even though cat logs/*.csv could work as well.

Consider how much time this would take to write in Python.

The script has one big flaw (we will solve it soon but it needs to be mentioned anyway).

The script writes to a file called _logs_merged.csv. We have prefixed the filename with underscore to mark it as somewhat special but still: what if the user created such file manually?

We would overwrite that file, no question asked. No option to recover.

Never do that in your scripts again.

You may also encounter variant where cut is called as cut -d, -f3. Most programs are smart enough to recognize both variants but it is important to remember that this is something that must be handled by each program.

That is, the program must be able to work with sys.argv[1] == '-d,' and with (sys.argv[1] == '-d') and (sys.argv[2] == ',').

Pipes (data streaming composition)

We finally move to the area where Linux excels: program composition. In essence, the whole idea behind Unix-family of operating systems is to allow easy composition of various small programs together.

Mostly, the programs that are composed together are filters and they operate on text inputs. These programs do not make any assumptions on the text format and are very generic. Special tools (that are nevertheless part of Linux software repositories) are needed if the input is more structured, such as XML or JSON.

The advantage is that composing the programs is very easy and it is very easy to compose them incrementally too (i.e., add another filter only when the output from the previous ones looks reasonable). This kind of incremental composition is more difficult in normal languages where printing data requires extra commands (here it is printed to the stdout without any extra work).

The disadvantage is that complex compositions can become difficult to read. It is up to the developer to decide when it is time to switch to a better language and process the data there. A typical division of labour is that shell scripts are used to preprocess the data: they are best when you need to combine data from multiple files (such as hundreds of various reports, etc.) or when the data needs to be converted to a reasonable format (e.g. non-structured logs from your web server into a CSV loadable into your favorite spreadsheet software or R). Computing statistics and similar tasks are best left to specialized tools.

Needless to add, Linux offers a plenty of tools for statistical computations or plot drawing utilities that can be controlled by CLI. Mastering of these tools is, unfortunately, out of topic for this course.

Let us return to the running example again.

We already mentioned that the temporary file we used is bad because we might have overwritten someone elses data.

But it also requires disk space for another copy of the (possibly huge) data.

A bit more subtle but much more dangerous problem is that the path to the temporary file is fixed. Imagine what happens if you execute the script in two terminals concurrently. Do not be fooled by the feeling that the script so short that the probability of concurrent execution is negligible. It is a trap that is waiting to spring. We will talk about proper use of mktemp(1) later, but in this example no temporary file is needed at all.

We learned about program composition, right? And we can use it here.

cat logs/20[0-9][0-9]-[01][0-9]-[0-3][0-9].csv | cut -d , -f 5

The | symbol stands for a pipe, which connects the standard output of cat to the standard input of cut. The pipe passes data between the two processes without writing them to the disk at all. (Technically, the data are passed using memory buffers, but that is a technical detail.)

The result is the same, but we escaped the pitfalls of using temporary files and the result is actually even more readable.

For cases when the first command also reads from standard input another syntax is available. For example, this prints a sorted list of local user accounts (usernames).

cut -d : -f 1 </etc/passwd | sort

We can even move the first < before cut, so that the script can be read left-to-right like “take /etc/passwd, extract the first column, and then sort it”:

</etc/passwd cut -d : -f 1 | sort

In essence, the family of unix systems is built on top of the ability of creating pipelines, which chain a sequence of programs using pipes. Each program in the pipeline denotes a type of transformation. These transformations are composed together to produce the final result.

Advancing the running example a bit more

We wanted to print the three most visited URLs first.

Using the pipe above we can print all the URLs in a single list.

To find the most often visited ones we will use a typical trick where we first sort the lines alphabetically and then use program uniq with -c to count unique lines (in effect counting how many times each URL was visited). We then sort this output by the numbers and print first 3 lines.

Hence our program will evolve like this (lines starting with # are obviously comments).

# Get all URLs
cat logs/20[0-9][0-9]-[01][0-9]-[0-3][0-9].csv | cut -d , -f 5

# We will make the wildcard shorter to save space
cat logs/*.csv | cut -d , -f 5

# Sort URLs, have same URLs on adjoining lines
cat logs/*.csv | cut -d , -f 5 | sort

# Count number of occurrences (uniq does not sort the file)
cat logs/*.csv | cut -d , -f 5 | sort | uniq -c

# Sort output of uniq numerically
cat logs/*.csv | cut -d , -f 5 | sort | uniq -c | sort -n

# Print last file lines only
cat logs/*.csv | cut -d , -f 5 | sort | uniq -c | sort -n | tail -n 3

Do not be scared. We advanced by little steps on each line. Run the individual commands yourself and watch how the output is transformed.

Exercise

Print the total amount of transferred bytes using the logs from our running example (i.e., the last part of the task).

Hint: you will need cat, cut, paste and bc.

First part should be easy: we are interested only in the last column.

cat logs/*.csv | cut -d , -f 4

To sum lines of numbers we will use paste that is able to merge lines from multiple files or join lines into a single file. We will give it separator of + to create a huge expression SIZE1+SIZE2+SIZE3+....

cat logs/*.csv | cut -d , -f 4 | paste -s -d +

Finally, we will use bc to sum the lines.

cat logs/*.csv | cut -d , -f 4 | paste -s -d + | bc

bc alone is a quite powerful calculator than can be used interactively too (recall that <Ctrl>-D will terminate the input in interactive mode).

More examples are provided at the end of this lab.

Quick check of filters

Select all true statements. You need to have enabled JavaScript for the quiz to work.

Writing your own filters

Let us finish another part of the running example. We want to compute traffic for each day and print days with the most traffic.

Knowing how we composed things so far, we lack only the middle part of the pipeline. Summing the sizes for each day.

There is no ready-made solution for this (advanced users might consider installing termsql) but we will create our own in Python and plug it into our pipeline.

We will try to make it simple yet versatile enough.

Recall we want to group the traffic by dates, hence our program should be able to do the following tranformation.

# Input
day1 1
day1 2
day2 4
day1 3
day2 1
# Output
day1 6
day2 5

Here is our version of the program. Notice that we have (for now) ignored error handling but allowed the program to be used as a filter in the middle of the pipeline (i.e., read from stdin when no arguments are provided) but also easily usable for multiple files.

In your own filters, you should also follow this approach: the amount of source code you need to write is negligible, but it gives the user flexibility in use.

#!/usr/bin/env python3

import sys

def sum_file(inp, results):
    for line in inp:
        (key, number) = line.split(maxsplit=1)
        results[key] = results.get(key, 0) + int(number)

def main():
    sums = {}
    if len(sys.argv) == 1:
        sum_file(sys.stdin, sums)
    else:
        for filename in sys.argv[1:]:
            with open(filename, "r") as inp:
                sum_file(inp, sums)
    for key, sum in sums.items():
        print(f"{key} {sum}")

if __name__ == "__main__":
    main()

With such program in place, we can extend our web statistics script in the following manner.

cat logs/*.csv | cut -d , -f 1,4 | tr ',' ' ' | ./group_sum.py

Use man to find out what tr does.

On your own, extend the solution to print only the top 3 days (sort can order the lines using different columns than the whole line too). Answer.

Standard error output

While it often makes sense to redirect the output, you often want to see error messages still on the screen.

Imagine files one.txt and two.txt exist while nonexistent.txt is not in the directory. We will now execute the following command.

No, do not imagine it. Create the files one.txt and two.txt to contain words ONE and TWO yourself on the command line. Hint. Answer.

cat one.txt nonexistent.txt two.txt >merged.txt

Obviously, cat prints an error message when the file does not exist. However, if the error message were printed to stdout, it would be redirected to merged.txt together with the actual output. This would not be practical.

Therefore, every Linux program also has a standard error output (often just stderr) that also goes to the screen but is logically different from stdout and is not subject to > redirection.

In Python, it is available as sys.stderr and it is (as sys.stdout) an opened file.

We can extend our implementation to handle I/O errors like this:

try:
    with open(filename, "r") as inp:
        sum_file(inp, sums)
except IOError as e:
    print(f"Error reading file {filename}: {e}", file=sys.stderr)

Under the hood (about file descriptors)

The following text provides overview of file descriptors that are abstractions used by the OS and the application when working with opened files. Understanding this concept is not essential for this course but it is a general principle that (to some extent) is present in most operating systems and applications (or programming languages).

Technically, opened files have so-called file descriptors that are used when an application communicates with the operating system (recall that file operations have to be done by the operating system). The file descriptor is an integer that serves as an index in a table of opened files that is kept for each process (i.e., a running instance of a program).

This number — the file descriptor — is then passed to system calls which operate on the opened file. For example, write gets two arguments: an opened file descriptor and a byte buffer to write (in our examples, we will pass the string directly for simplicity). Therefore, when your application calls print("Message", file=some_file), eventually your program would call the operating system as write(3, "Message\n") where 3 denotes the file descriptor for the opened file represented by the some_file handle.

While the above may look like a technical detail, it will help you understand why the standard error redirection looks the way it does, or why file operations in most programming languages require opening the file first before writing to it (i.e., why write_to_file(filename, contents) is never a primitive operation).

In any unix-style environment, the file descriptors 0, 1, and 2 are always used for standard input, standard output, and standard error output, respectively. That is, the call print("Message") in Python eventually ends up in calling write(1, "Message\n") and a call to print("Error", file=sys.stderr) calls write(2, "Error\n").

When a new process is started, it obtains these three file descriptors from its caller (e.g., the shell). By default, they point to the terminal, but the caller can simply open them to point to a different file. This is how redirection works.

The fact that stdout and stderr are logically different streams (files) also explains the word probably in one of the examples above. Even though they both end in the same physical device (the terminal), they may use a different configuration: typically, the standard output is buffered, i.e., output of your application goes to the screen only when there is enough of it, while the standard error is not buffered – it is printed immediately. The reason is probably obvious – error messages should be visible as soon as possible, while normal output might be delayed to improve performance.

Note that the buffering policy can be more sophisticated, but the essential take away is that any output to the stderr is displayed immediately while stdout might be delayed.

Advanced I/O redirection

Ensure you have the group_sum.py script available.

Prepare files one.txt and two.txt:

echo ONE 1 > one.txt
echo ONE 1 > two.txt
echo TWO 2 >> two.txt

Now execute the following commands.

./group_sum.py <one.txt
./group_sum.py one.txt
./group_sum.py one.txt two.txt
./group_sum.py one.txt <two.txt

Has it behaved as you expected?

Trace which paths (i.e. through which lines) the program has taken with the above invocations.

Redirecting standard error output

To redirect the standard error output, you can use > again, but this time preceded by the number 2 (that denotes the stderr file descriptor).

Hence, our cat example can be transformed to the following form where err.txt would contain the error message and nothing would be printed on the screen.

cat one.txt nonexistent.txt two.txt >merged.txt 2>err.txt

Redirecting into and inside a script

Consider the following mini-script (first-column.sh) that extracts and sorts the first column (for colon-delimited data such as in /etc/passwd).

#!/bin/bash

cut -d : -f 1 | sort

Then the user can use the script like this and cut standard input would be properly wired to the shell standard input or through the pipe.

cat /etc/passwd | ./first-column.sh
./first-column.sh </etc/passwd
head /etc/passwd | ./first-column.sh | tail -n 3

While the above example is somewhat artificial but it demonstrates the important principle that stdin is naturally available even inside scripts when redirected from the “outside”.

Generic redirection

Shell allows us to redirect outputs quite freely using file descriptor numbers before and after the greater-than sign.

For example, >&2 specifies that the standard output is redirected to a standard error output. That may sound weird but consider the following mini-script.

Here, wget used to fetch file from given URL.

echo "Downloading tarball for lab 02..." >&2
wget https://d3s.mff.cuni.cz/f/teaching/nswi177/202122/labs/nswi177-lab02.tar.gz 2>/dev/null

We actually want to hide the progress messages of wget and print ours instead.

Take this as an illustration of the concept as wget can be silenced via command-line arguments (--quiet) as well.

Sometimes, we want to redirect stdout and stderr to one single file. In these situations simple >output.txt 2>output.txt would not work and we have to use >output.txt 2>&1 or &>output.txt (to redirect both at once). However, what about 2>&1 >output.txt, can we use it as well? Try it yourself! Hint.

Notable special files

We already mentioned that virtually everything in Linux is a file. Many special files representing devices are in /dev/ subdirectory.

Some of them are very useful for output redirection.

Run cat one.txt and redirect the output to /dev/full and then to /dev/null. What happened?

Especially /dev/null is a very useful file as it can be used in any situation when we are not interested in the output of a program.

For many programs you can specify the use of stdin explicitly by using - (dash) as the input filename.

Another option is to use /dev/stdin explicitly: with this name, we can make the example with group_sum.py work:

./group_sum.py /dev/stdin one.txt <two.txt

Then Python opens the file /dev/stdin as a file and operating system (together with shell) actually connects it with two.txt.

/dev/stdout can be used if we want to specify standard output explicitly (this is mostly useful for programs coming from other environments where the emphasis is not on using stdout that much).

Program return (exit) code

So far, the programs we have used announced errors as messages. That is quite useful for interactive programs as the user wants to know what went wrong.

However, for non-interactive use, checking for error messages is actually very error-prone. Error messages change, the users can have their system localized etc. etc. Therefore, Linux offers a different way of checking whether a program terminated correctly or not.

Whether a program terminates successfully or with a failure, is signalled by its so-called return (or exit) code. This code is an integer and unlike in other programming languages, zero denotes success and any non-zero value denotes an error.

Why do you think that the authors decided that zero (that is traditionally reserved for false) means success and nonzero (traditionally converted to true) means failure? Hint: in how many ways can a program succeed?

Unless specified otherwise, when your program terminates normally (i.e., main reaches the end and no exception is raised), the exit code is zero.

If you want to change this behavior, you need to specify this exit code as a parameter to the exit function. In Python, it is sys.exit.

For C programs, the main function actually returns an int, whose value is the exit code. Use it properly.

The full signature is actually int main(int argc, char *argv[]) so that you can access command-line options as function arguments (most environments will actually allow you to use plain void main(void) but it is not recommended).

As an example, the following is a modification of the group_sum.py above, this time with proper exit code handling.

def main():
    sums = {}
    exit_code = 0
    if len(sys.argv) == 1:
        sum_file(sys.stdin, sums)
    else:
        for filename in sys.argv[1:]:
            try:
                with open(filename, "r") as inp:
                    sum_file(inp, sums)
            except IOError as e:
                print(f"Error reading file {filename}: {e}", file=sys.stderr)
                exit_code = 1
    for key, sum in sums.items():
        print(f"{key} {sum}")
    sys.exit(exit_code)

We will later see that shell control flow (e.g., conditions and loops) is actually controlled by program exit codes.

Failing fast

So far, we expected that our shell scripts will never fail. We have not prepared them for any kind of failure.

We will eventually see how exit codes can be tested and used to control our shell scripts more, but for now we want to stop whenever any failure occurs.

That is actually quite sane behavior: you typically want the whole program to terminate if there is an unexpected failure (rather than continuing with inconsistent data). Like an uncaught exception in Python.

To enable terminate-on-failure, you need to call set -e. In case of failure, the shell will stop executing the script and exit with the same exit code as the failed command.

Furthermore, you usually want to terminate the script when an uninitialized variable is used: that is enabled by set -u. We will talk about variables later but -e and -u are usually set together.

And there is also a caveat regarding pipes and success of commands: the success of a pipeline is determined by its last command. Thus, sort /nonexistent | head is a successful command. To make a failure of any command fail the (whole) pipeline, you need to run set -o pipefail in your script (or shell) before the pipeline.

Therefore, typically, you want to start your script with the following trio:

set -o pipefail
set -e
set -u

Many commands allow short options (such as -l or -h you know from ls) to be merged like this (note that -o pipefail has to be last):

set -ueo pipefail

Get into a habit where each of your scripts starts with this command.

Actually, from now on, the GitLab pipeline will check that this command is a part of your scripts.

Pitfalls of pipes (a.k.a. SIGPIPE)

set -ueo pipefail can sometimes cause unwanted and quite unexpected behavior.

The following script terminates with a hard-to-explain error, i.e., we never reach the final echo. Note that the final hexdump is there only to ensure we do not print garbage from /dev/urandom directly on the terminal.

#!/bin/bash

set -ueo pipefail

cat /dev/urandom | head -n 1 | hexdump

echo OKAY NOT PRINTED

Despite the fact that everything looks fine.

The reason comes from the head command. head has a very smart implementation that terminates after first -n lines were printed. Reasonable right? But that means that the first cat is suddenly writing to a pipe that no one reads. It is like writing to a file that was already closed. That generates an exception (well, kind of) and cat terminates with an error. Because of set -o pipefail, the whole pipeline fails.

The truth is that distinguishing whether the closed pipe is a valid situation that shall be handled gracefully or if it indicates an issue is impossible. Therefore cat terminates with an error (after all, someone just closed its output without letting it know first) and thus the shell has to mark the whole pipeline as failed.

Solving this is not always easy and several options are available. Each has its pros and cons.

When you know why this can occur, adding || true marks the pipeline as fine (we will learn about || later on, though).

Exit code: check you understand the basics

Select all true statements. You need to have enabled JavaScript for the quiz to work.

Shell customization

We already mentioned that you should customize your terminal emulator to make it comfortable to use. After all, you will spend at least this semester with it and it should be fun to use.

In this lab, we will show some other options how to make your shell more comfortable to use.

Command aliases

You probably noticed that you execute some commands with the same options a lot. One such example could be ls -l -h that prints a detailed file listing, using human-readable sizes. Or perhaps ls -F to append a slash to the directories. And probably ls --color, too.

Shell offers to create so-called aliases where you can easily add new commands without creating full-fledged scripts somewhere.

Try executing the following commands to see how a new command l could be defined.

alias l='ls -l -h`
l

We can even override the original command, the shell will ensure that rewriting is not a recursive.

alias ls='ls -F --color=auto'

Note that these two aliases together also ensure that l will display filenames in colors.

There are no spaces around the equal sign.

Some typical aliases that you will probably want to try are the following ones. Use a manual page if you are unsure what the alias does. Note that curl is used to retrieve contents from a URL and wttr.in is really a URL. By the way, try that command even if you do not plan to use this alias :-).

alias ls='ls -F --color=auto'
alias ll='ls -l'
alias l='ls -l -h'

alias cp='cp -i'
alias mv='mv -i'
alias rm='rm -i'

alias man='man -a'

alias weather='curl wttr.in'

`~/.bashrc`

Aliases above are nice, but you probably do not want to define them each time you launch the shell. However, most shells in Linux have some kind of file that they execute before they enter interactive mode. Typically, the file resides directly in your home directory and it is named after the shell, ending with rc (you can remember it as runtime configuration).

For Bash which we are using now (if you are using a different shell, you probably already know where to find its configuration files), that file is called ~/.bashrc.

You have already used it when setting EDITOR for Git, but you can also add aliases there. Depending on your distribution, you may already see some aliases or some other commands there.

Add aliases you like there, save the file and launch a new terminal. Check that the aliases work.

The .bashrc file behaves as a shell script and you are not limited to have only aliases there. Virtually any commands can be there that you want to execute in every terminal that you launch.

Changing your prompt (`$PS1`)

You can also modify how your prompt looks like. The default is usually reasonable but some people prefer more information in there. If you are one of those, here are the details (take it as an overview as prompt customization is a topic for a whole book).

The prompt is modified through the PS1 variable. We will talk about variables in more detail later on, for now we will learn the syntax only.

When setting the variable, we can directly modify it in shell and immediatelly observe the result.

Try executing the following command.

PS1=''

The prompt is gone. We have set it to an empty string.

PS1='Enter your commands: '

This is much better, right?

And try the following:

PS1='\w '

Here we set it to print current directory and a space. The special sequence \w will be automatically replaced by the name of the working directory.

Many users prefer to know as which user they are logged in.

PS1='\u: \w '

The usual tradition is end the prompt with a dollar sign.

PS1='\u \w\$ '

Using a special sequence of \[\033[01;32m\] and \[\033[0m\] we can change the prompt color too.

PS1='\[\033[01;32m\]\u \w\[\033[0m\]\$ '

Use different numbers in place of the 32 to modify the color yourself. Special value of 0m switches back to terminal default.

It is also possible to add your own commands to be executed or even make the prompt multi-line.

PS1='$( date ) \u \w\$ '

Here, the special part $( date ) denotes that output from the program date will become part of the prompt (we will talk about $( ) construct later on, take it as a teaser here only).

Using \n allows us to split the prompt into multiple lines.

PS1='\n$( date )\n\u \w\$ '

And of course, everything can be combined.

PS1='\n\[\033[01;32m\]$( date )\[\033[0m\]\n\[\033[01;34m\]\u\[\033[0m\] \[\033[01;35m\]\w\[\033[0m\]\$ '

More examples

The following examples can be solved either by executing multiple commands or by piping basic shell commands together. To help you find the right program, you can use manual pages. You can also use our manual as a starting point.

Note that none of the solutions requires anything else than using few pipelines. For advanced users: definitely you do not need if or while or read or even using PERL or AWK.

Use the following CSV with data on how long it took to copy the USB disk image to the USB drives in the library. The first column represents the device, the second duration of the copying.

As a matter of fact, the first column also indirectly represents port of the USB hub (this is more by accident but it stems from the way we organized the copying). As a sidenote: it is interesting to see that some ports that are supposed to be the same are actually systematically slower.

We want to know what was the longest duration of the copying: in other words, the maximum of column two.

Solution.

Create a directory a and inside it create a text file --help containing Lorem Ipsum. Print the content of this file and then delete it. Solution.

Create a directory called b and inside it create files called alpha.txt and *. Then delete the file called * and watch out what happened to the file alpha.txt. Solution.

Print the content of the file /etc/passwd sorted by the rows. Solution.

Print the first and third column of the file /etc/group. Solution.

Count the lines of the file /etc/services. Solution.

Print last two lines of the files /etc/passwd and /etc/group using a single command. Solution.

Recall the file disk-speeds-data.csv with the disk copying durations. Compute the sum of all durations. Solution.

Consider the following file format.

Alpha     8  4  5  0
Bravo    12  5  3  2
Charlie   1  0 11  4

Append to each row sum of its line. You do not need to keep the original alignment (i.e., feel free to squeeze the spaces). Hint. Solution.

Print the contents of /etc/passwd and /etc/group separated by text Ha ha ha (i.e., contents of /etc/passwd, line with Ha ha ha and contents of /etc/group). Solution.

Print vendors of your CPU. Use the file /proc/cpuinfo as the starting point.

Solution.

Before-class tasks (deadline: start of your lab, week March 6 - March 10)

The following tasks must be solved and submitted before attending your lab. If you have lab on Wednesday at 10:40, the files must be pushed to your repository (project) at GitLab on Wednesday at 10:39 latest.

For virtual lab the deadline is Tuesday 9:00 AM every week (regardless of vacation days).

All tasks (unless explicitly noted otherwise) must be submitted to your submission repository. For most of the tasks there are automated tests that can help you check completeness of your solution (see here how to interpret their results).

~~We are sorry but the automated tests are not yet ready. We will upload them ASAP.~~ Tests are available.

This lab is about pipes. The shell tasks here must be solved using pipes, not using shell loops (even if you know them) or by off-loading to another programming language.

`04/line_count.sh` (30 points, group `shell`)

Count total number of lines of all text files (i.e., *.txt) in current directory. The script will output only a single number.

You can assume that there will be always at least one such file present.

`04/users.sh` (40 points, group `admin`)

Print real names of users containing system anywhere in their record (i.e. the word system appears anywhere on the line).

List of users is stored either in /etc/passwd or via getent passwd. Your script will assume that the list of users will come on standard input.

Hence test it as getent passwd | 04/users.sh.

`04/fastest.sh` (30 points, group `shell`)

Assume the following input format (durations are integers) containing program execution durations together with their authors.

name1,duration_in_seconds_1
name2,duration_in_seconds_2

Write author of the fastest solution (you can safely assume that the durations are distinct).

Post-class tasks (deadline: March 26)

We expect you will solve the following tasks after attending the labs and hearing feedback to your before-class solutions.

~~We are sorry but the automated tests are not yet ready. We will upload them ASAP.~~ Tests are available.

This lab is about pipes. The shell tasks here must be solved using pipes, not using shell loops (even if you know them) or by off-loading to another programming language.

`04/row_sum.sh` (50 points, group `shell`)

Assume that you have a a matrix writen in a “fancy” notation. You can rely that the format is fixed (with regard to spacing, 3 digits maximum, position of pipe symbol etc.) but the number of columns or rows can differ.

Write a script that prints sum of each row.

We expect that for the following matrix we would get this output.

| 106 179 |
| 188  50 |
|   5 125 |

285
238
130

The script will read input from stdin, there is no limit on the amount of columns or rows but you can rely on the fixed format as explained above.

`04/day_of_week.py` (50 points, group `devel`)

Write a Python filter that converts date to day of week.

The program will convert dates in first column only (using whitespace for splitting), invalid dates will be ignored (and the line will be kept as-is). Rest of the column will copied to the output.

2023-02-20 Rest of the line
Some other line
2023-02-21 Line  contents

Monday Rest of the line
Some other line
Tuesday Line  contents

The program must be able launchable as:

04/day_of_week.py <input.txt
04/day_of_week.py input.txt
cat one.txt two.txt | 04/day_of_week.py

If the file cannot be opened, the program will print an error message to stderr (exact wording is defined by the tests) and will terminate with exit code 1.

You can expect that the program will not be invoked as 04/day_of_week.py one.txt two.txt.

We expect you will use functions from the datetime module.

Learning outcomes

Learning outcomes provide a condensed view of fundamental concepts and skills that you should be able to explain and/or use after each lesson. They also represent the bare minimum required for understanding subsequent labs (and other courses as well).

Conceptual knowledge

Conceptual knowledge is about understanding the meaning and context of given terms and putting them into context. Therefore, you should be able to …

explain what is standard input and output
explain why standard input or output redirection is not (directly) observable from within the program
explain why there are two output streams: stdout and stderr
explain how execution of cat foo.txt and cat <foo.txt differs
explain how standard inputs/outputs of several programs can be chained together
explain what is program exit code
explain differences and typical uses for the main five interfaces of a command-line program: command-line arguments, stdin, stdout, stderr, and exit code
optional: explain what is a file descriptor (from the perspective of a userland developer)

Practical skills

Practical skills are usually about usage of given programs to solve various tasks. Therefore, you should be able to …

redirect standard input and standard (error) output of a program in shell
set exit code of a Python script
use the special file /dev/null
use standard input and output in Python
use the pipe | to chain multiple programs together
use basic text filtering tools: cut, sort, …
use grep -F to filter lines matching provided pattern
optional: customize shell script with aliases
optional: store custom shell configuration in .bashrc (or .profile) scripts
optional: customize prompt with the PS1 variable

This page changelog

2023-02-25: Move task 04/users.sh to the admin group.
2023-03-03: Emphasize how stdin can be redirected into a script.

Note that this solution requires reading the input twice, hence we assume that the input is in score.txt.

<score.txt tr -s ' ' | cut -d ' ' -f 2- | tr ' ' '+' | bc | paste score.txt - | tr '\t' ' '

Running example

Standard input and outputs

Standard output

Standard input

Standard I/O redirection

Appending in output redirection

Input redirection

Standard input and output: check you understand the basics

Filters

Advancing the running example

Pipes (data streaming composition)

Advancing the running example a bit more

Exercise

Quick check of filters

Writing your own filters

Standard error output

Under the hood (about file descriptors)

Advanced I/O redirection

Redirecting standard error output

Redirecting into and inside a script

Generic redirection

Notable special files

Program return (exit) code

Failing fast

Pitfalls of pipes (a.k.a. SIGPIPE)

Exit code: check you understand the basics

Shell customization

Command aliases

~/.bashrc

Changing your prompt ($PS1)

More examples

Before-class tasks (deadline: start of your lab, week March 6 - March 10)

04/line_count.sh (30 points, group shell)

04/users.sh (40 points, group admin)

04/fastest.sh (30 points, group shell)

Post-class tasks (deadline: March 26)

04/row_sum.sh (50 points, group shell)

04/day_of_week.py (50 points, group devel)

Learning outcomes

Conceptual knowledge

Practical skills

This page changelog

`~/.bashrc`

Changing your prompt (`$PS1`)

`04/line_count.sh` (30 points, group `shell`)

`04/users.sh` (40 points, group `admin`)

`04/fastest.sh` (30 points, group `shell`)

`04/row_sum.sh` (50 points, group `shell`)

`04/day_of_week.py` (50 points, group `devel`)