Labs: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.
A script in the Linux environment is any program that is interpreted when being run (i.e., the program is distributed as a source code). In this sense, there are shell scripts (the language is the shell as you have seen it last time), Python, Ruby or PHP scripts.
The advantage of so-called scripting languages is that they do require only a text editor for development and that they are easily portable. Disadvantage is that you need to install the interpreter first. Fortunately, Linux typically comes with many interpreters preinstalled and starting with a scripting language is thus very easy.
We will build this lab around a shell script that we will incrementally develop, so that you learn the basic concepts on a practical example (obviously, there are specific tools that could be used instead, but we hope that this is better than a completely artificial example).
Preflight checklist
- You have selected a nice Terminal emulator for yourself. Both on the school machine as well as in your private Linux installation.
- You have selected a nice TUI text editor that you know how to control. Ensure it is available both in the lab and on your machine and it is configured correctly.
- You can use
mc
orranger
for basic file operations. - You can use
cd
,pwd
,ls
,cat
(andhexdump
) to navigate through the file system and inspect files.
Running example
Data for our example can be downloaded from
this project
(they are inside 03
directory).
Feel free to grab the whole repository as a tarball/ZIP file (links are under
the blue Code button) and unpack it in mc
.
They simulate simplified logs from a web server, where the web server records which files (URLs) were accessed at which time.
Practically, each file represents traffic for one day in a simplified CSV format.
Fields are separated by a comma, there is no header, and for each record we remember the date, the client’s IP address, the URL that was requested, and the amount of transferred bytes.
Our task is to write a program that prints a brief summary of the data:
- Print 3 most accessed URLs.
- Print total amount of data transferred.
- Print 3 days with the highest volume of traffic (i.e., the sum of transferred bytes).
But before we build the solution we need to lay some groundwork. And because there will be a lot of that, we will finish the third subtask during the next lab.
Shell scripts
To write a shell script, we simply write the commands into a file (instead of typing them in a terminal).
Therefore, a simple script that prints some information about your system could be as simple as the following.
cat /proc/cpuinfo
cat /proc/meminfo
If you store this into a file first.sh
, then you can execute it with
the following command.
bash first.sh
Notice that we have executed bash
as that is the shell program (interpreter)
that we are using and the name of the input file.
It will cat
those two files (note that we could have executed a single
cat
with two arguments as well).
Recall that your 01/dayname.py
script can be executed with the following
command (again, we run the right interpreter).
python3 dayname.py
Shebang and executable bit
Running scripts by specifying the interpreter to use (i.e., the command to run the script file with) is not very elegant. There is an easier way: we mark the file as executable and Linux handles the rest.
Actually, when we execute the cat
command or mc
, there is a file
(usually in the /bin
or /usr/bin
directory) that is named cat
or mc
and
is marked executable.
(For now, imagine the special executable mark as a special file attribute.)
Notice that there is no file extension.
However, marking the file as executable is only the first half of the solution.
Imagine that we create the following content and store it into a file hello.py
marked as executable.
print("Hello")
And then we want to run it.
But wait! How will the system know which interpreter to use? For binary executables (e.g., originally from C sources), it is easy as the binary is (almost) directly in the machine code. But here we need an interpreter first.
In Linux, the interpreter is specified via so-called shebang or hashbang.
As a matter of fact, you have already encountered it several times:
When the first line of the script starts with #!
(hence the name hash and bang), Linux expects a path to the interpreter
after it and will run this interpreter and ask it to execute the script.
For shell scripts, we will be using #!/bin/bash
, for Python we need to
use #!/usr/bin/env python3
.
We will explain the env
later on; for now, please just remember to use
this version.
#!/usr/bin/env python3
for your
Python scripts. #!/usr/bin/env python
or #!/usr/bin/python3
are wrong and can cause various surprises.
#
to denote a comment which
means that no extra handling is needed to skip the first line
(as it is really not needed by the interpreter).
#!/bin/sh
for shell scripts.
For most scripts is actually does not matter: simple constructs work the same,
but /bin/bash
offers some nice extensions.
We will be using /bin/bash
in this course as the extensions are rather useful.
You may need to use /bin/sh
if you are working on older systems or you need
to have your script portable to different flavours of Unix systems.
To complicate things a bit more, on some systems /bin/sh
is the same
as /bin/bash
as it is really a superset.
Bottom line is: unless you know what you are doing, stick with #!/bin/bash
shebang for now.
Now back to the original question: how is the script executed.
The system takes the command from the shebang, appends the actual filename
of the script as a parameter, and runs that.
When the user specifies more arguments (such as --version
), they are appended
as well.
For example, if hexdump
were actually a shell script, it would
start with the following:
#!/bin/bash
...
code-to-loop-over-bytes-and-print-them-goes-here
...
Executing hexdump -C file.gif
would then actually execute the following
command:
/bin/bash hexdump -C file.gif
Notice that the only magic thing behind shebang and executable files is that the system assembles a longer command line.
The user does not need to care about the implementation language.
Let us try it practically.
We know about the shebang, so we will update our example and also mark the file as an executable one.
Store the following into first.sh
.
#!/bin/bash
cat /proc/cpuinfo
cat /proc/meminfo
To mark it as executable, we run the following command. For now, please, remember it as a magic that must be done, more details why it looks like this will come later.
chmod +x first.sh
chmod
will not work on file systems that are not Unix/Linux-friendly.
That unfortunately includes even NTFS.
Now we can easily execute the script with the following command:
./first.sh
The obvious question is: why the redundant ./
is needed instead of just
calling first.sh
?
The ./
refers to the current directory after all, right
(recall previous lab)?
So it refers to the same file!
When you type a command (e.g., cat
) without any path (i.e., only bare filename
containing the program),
shell looks into so-called $PATH to actually find the file with the program
(usually, $PATH
would contain directory /usr/bin
where most of the
executable binaries are stored).
Unlike in other operating systems, shell does not look into the working
directory when program cannot be found in the $PATH
.
To run a program in the current directory, we need to specify its path
(when any extra path is provided, shell ignores $PATH
and simply looks
for the file).
Luckily, it does not have to be an absolute path, but a relative one is
sufficient. Hence the magic spell of ./
.
If you move to another directory, you can execute it by providing a relative
path too, such as ../first.sh
.
Run ls
in the directory now.
You should see first.sh
now printed in green.
If not, you can try ls --color
or check that you have run chmod
correctly.
If you do not have a colorful terminal (unusual but still possible),
you can use ls -F
to distinguish file types: directories will
have a slash appended, executable files will have an asterisk next to
their filename.
Mini-excercise
Changing working directory
Let us modify our first script a little bit.
cd /proc
cat cpuinfo
cat meminfo
Run the script again.
Despite the fact that the script changed directory to /proc
,
when it terminates, we are still in the original directory.
Try inserting pwd
to ensure that the script really is inside /proc
.
This also means that cd
itself cannot be a normal binary.
Because if it would be a normal program (e.g., in Python), any change inside
it would be useless after its termination.
Hence, cd
is a so called builtin that is implemented inside the shell itself.
Debugging the scripts
If you want to see what is happening, run the script as bash -x first.sh
.
Try it now.
For longer scripts, it is better to print your own messages as -x
tends to
become too verbose.
To print a message to the terminal, you can use the echo
command.
With few exceptions (more about these later), all arguments are simply
echoed to the terminal.
Create a script echos.sh
with the following content and explain
the differences:
#!/bin/bash
echo alpha bravo charlie
echo alpha bravo charlie
echo "alpha bravo" charlie
Answer.
Advancing our running example
We will now start working on our running example to prepare it.
For starters, create a version that simply echos the list of files we will work with.
Assume that the program will read files from the logs
subdirectory.
Do not forget to make your script executable and add the right shebang.
Command-line arguments
Command-line arguments (such as -l
for ls
or -C
for hexdump
) are
the usual way to control the behaviour of CLI tools in Linux.
For us, as developers, it is important to learn how to work with them inside
our programs.
We will talk about using these arguments in shell scripts later on, today we will handle them in Python.
Accessing these arguments in Python is very easy.
We need to add import sys
to our program and then we can access these arguments
in the sys.argv
list.
Therefore, the following program prints its arguments.
#!/usr/bin/env python3
import sys
def main():
for arg in sys.argv:
print("'{}'".format(arg))
if __name__ == '__main__':
main()
When we execute it (of course, first we chmod +x
it), we will see the
following (lines prefixed with $
denote the command, the rest is command output).
$ ./args.py
'./args.py'
$ ./args.py one two
'./args.py'
'one'
'two'
$ ./args.py "one two"
'./args.py'
'one two'
Note that the zeroth index is occupied by the command itself (we will not use it now, but it can be used for some clever tricks) and notice how the second and third command differs from inside Python.
It should not be surprising though, recall the previous lab and handling of filenames with spaces in them.
Run the above command and give it a wildcard as a parameter.
Assuming you already have some shell scripts with .sh
extension, look
at the behavior of the following invocations.
./args.py *.py
./args.py *.sh
./args.py *.shhhhhh
Recall previous lab if you are unsure what has happened. Hint.
Standard input and outputs
You probably know the following concepts already but maybe not under exactly these names, hence we will try to refresh your knowledge about them.
Standard output
Standard output (often shortened to stdout) is the default output that you can
use by calling print("Hello")
if you are in Python, for example.
Stdout is used by the basic output routines in almost every programming language.
Quick check: how you print to stdout in shell? Answer.
Generally, this output has the same API as if you were writing to a file.
Be it print
in Python, System.out.print
in Java or printf
in C
(where the limitations of the language necessitate the existence of a pair of
printf
and fprintf
).
This output is usually prepared by the language runtime together with the shell and the operating system (the technical details are not that important for this course anyway). Practically, the standard output is printed to the terminal or its equivalent (and when the application is launched graphically, stdout is typically lost).
Note that in Python you can access it explicitly via sys.stdout
that
acts as an opened file handle (i.e., result of open
).
Standard input
Similarly to stdout, almost all languages have access to stdin that represents the default input. By default, this input comes from the keyboard, although usually through the terminal (i.e., stdin is not used in graphical applications for reading keyboard input).
Note that the function input()
that you may have used in your Python
programs is an upgrade on top of stdin because it offers basic editing
functions.
Plain standard input does not support any form of editing
(though typically you could use backspace to erase characters at the
end of the line).
If you want to access the standard input in Python, you need to use sys.stdin
explicitly.
As one could expect, it uses a file API, hence it is possible to read a line
from it calling .readline()
on it or to iterate through all lines.
In fact, the iteration of the following form is a quite common pattern for many Linux utilities (they are usually written in C but the pattern remains the same).
for line in sys.stdin:
...
Many of the utilities actually read from stdin by default.
For example, cut -d : -f 1
prints only the first column of data of each
line (and expects the columns to be delimited by :
).
Run it and type the following on the keyboard, terminating each line with <Enter>
.
cut -d : -f 1
one:two
alpha:bravo
uno:dos
You should see the first column echoed underneath your input.
What to do when you are done? Typing exit
will not help here but <Ctrl>-D
works.
<Ctrl>-D
on an empty line will close the standard input. The program
cut
will realize that there is no more input to process and will gracefully
terminate. Note that this is something else than <Ctrl>-C
which forcefully
kills the running process. From the user’s perspective, these look similar in the
context of the utility cut
, but the behavior is totally different with important
semantics difference (that can be observed when using other tools).
Standard I/O redirection
As a technical detail, we mentioned earlier that the standard input and output are prepared (partially) by the operating system. This also means that it can be changed (i.e., initialized differently) without changing the program. And the program may not even “know” about it.
This is called redirection and it allows the user to specify that the standard output would not go to the screen (terminal), but rather to a file. From the point of view of the program, the API is still the same.
This redirection has to be done before the program is started and it has to be done by the caller. For us, it means we have to do it in the shell.
It is very simple: at the end of the command we can specify > output.txt
and
everything that would be normally printed on a screen goes to output.txt
.
Before you start experimenting: the output redirection is a low-level operation and has no form of undo. Therefore, if the file you redirect to already exists, it will be overwritten without questions. And without any easy option to restore the original file content (and for small files, the restoration is technically impossible for most file systems used in Linux).
As a precaution, get into a habit to hit <Tab>
after you specify the filename.
If the file does not exist, the cursor will not move.
If the file already exists, the tab completion routine will insert a space.
As the simplest example, the following two commands will create files one.txt
and
two.txt
with the words ONE
and TWO
inside (including the new line character at the end).
echo ONE > one.txt
echo TWO >two.txt
Note that the shell is quite flexible in the use of spaces and both options are valid
(i.e., one.txt
does not have a space as the first character in the filename).
From implementation point of view, echo
received a single argument, the part
with > filename
is not passed to the program at all
(i.e., do not expect to find > filename
in your sys.argv
).
popen
or a similar call, they also offer the option
to specify which file to use for stdout if you want to do a redirection
in your program
(but only for a new program launched, not inside a running program).
If you recall Lab 02, we mentioned that
the program cat
is used to concatenate files.
With the knowledge of output redirection, it suddenly starts to make more sense as
the (merged) output can be easily stored in a file.
cat one.txt two.txt >merged.txt
Appending in output redirection
The shell also offers an option to append the output to an existing file
using the >>
operator.
Thus, the following command would add UNO
as another line into one.txt
.
echo UNO >>one.txt
If the file does not exist, it will be created.
For the following example, we will need the program tac
that reverses the order of
individual lines but otherwise works like cat
(note that tac
is cat
but backwards, what a cool name). Try this first.
tac one.txt two.txt
If you have executed the commands above, you should see the following:
UNO
ONE
TWO
Try the following and explain what happens (and why) if you execute
tac one.txt two.txt >two.txt
Answer.
Input redirection
Similarly, the shell offers <
for redirecting stdin.
Then, instead of reading input typed by the user on the keyboard, the program
reads the input from a file.
Note that programs using Pythonic input()
do not work that well with
redirected input.
Practically, input()
is suitable for interactive programs only.
You might want to use sys.stdin.readline()
or for line in sys.stdin
instead.
When input is redirected, we do not need to issue <Ctrl>-D
to close
the input as the input is closed automatically when reaching the end of the file.
Standard input and output: check you understand the basics
Filters
Many utilities in Linux work as so-called filters. They accept the input from stdin and print their output to stdout.
One such example is cut
that can be used to print only certain columns
from the input.
For example, running it as cut -d : -f 1
with /etc/passwd
as its input
will display a list of accounts (usernames) on the current machine.
Try to run the following two commands (and notice the difference).
cut -d : -f 1 </etc/passwd
cut -d : -f 1 /etc/passwd
The above behavior is quite common for most filters: you can specify the input file explicitly, but when it is missing, the program reads from the stdin.
What is the difference between the two invocations above? They will print the same result, after all.
In the first case
(with input redirection), the input file is opened by the shell and opened
file is passed to cut
.
Problems in opening the file are reported by shell and cut
might not be
launched at all.
In the second case, the file is opened by cut
(i.e., cut
executes the
open()
call and also needs to handle errors).
Advancing the running example
Armed with this knowledge, we can actually solve the first part of our running example. Recall that we have files that logged traffic each day and we want to find URLs that are most common in all the files together.
That means we need to join all files together, keep only the URL and find the three most frequent lines.
And we can do that. Recall that cat
can be used concatenate files and
cut
can be used to keep only certain columns. We will do finding the most
frequent URL in a while.
So, how about this?
#!/bin/bash
echo "Will look into the following files:" logs/20[0-9][0-9]-[01][0-9]-[0-3][0-9].csv
cat logs/20[0-9][0-9]-[01][0-9]-[0-3][0-9].csv >_logs_merged.csv
cut -d , -f 5 <_logs_merged.csv
We have used a quite explicit wildcard to ensure we do not print some
random CSVs even though cat logs/*.csv
could work as well.
Consider how much time this would take to write in Python.
The script has one big flaw (we will solve it soon but it needs to be mentioned anyway).
The script writes to a file called _logs_merged.csv
. We have prefixed
the filename with underscore to mark it as somewhat special but still:
what if the user created such file manually?
We would overwrite that file, no question asked. No option to recover.
Never do that in your scripts.
You may also encounter variant where cut
is called as cut -d, -f3
.
Most programs are smart enough to recognize both variants but it is important
to remember that this is something that must be handled by each program.
That is, the program must be able to work with sys.argv[1] == '-d,'
and
with (sys.argv[1] == '-d') and (sys.argv[2] == ',')
.
Pipes (data streaming composition)
We finally move to the area where Linux excels: program composition. In essence, the whole idea behind Unix-family of operating systems is to allow easy composition of various small programs together.
Mostly, the programs that are composed together are filters and they operate on text inputs. These programs do not make any assumptions on the text format and are very generic. Special tools (that are nevertheless part of Linux software repositories) are needed if the input is more structured, such as XML or JSON.
The advantage is that composing the programs is very easy and it is very easy to compose them incrementally too (i.e., add another filter only when the output from the previous ones looks reasonable). This kind of incremental composition is more difficult in normal languages where printing data requires extra commands (here it is printed to the stdout without any extra work).
The disadvantage is that complex compositions can become difficult to read. It is up to the developer to decide when it is time to switch to a better language and process the data there. A typical division of labour is that shell scripts are used to preprocess the data: they are best when you need to combine data from multiple files (such as hundreds of various reports, etc.) or when the data needs to be converted to a reasonable format (e.g. non-structured logs from your web server into a CSV loadable into your favorite spreadsheet software or R). Computing statistics and similar tasks are best left to specialized tools.
Let us return to the running example again.
We already mentioned that the temporary file we used is bad because we might have overwritten someone elses data.
But it also requires disk space for another copy of the (possibly huge) data.
A bit more subtle but much more dangerous problem is that the path to the
temporary file is fixed.
Imagine what happens if you execute the script in two terminals concurrently.
Do not be fooled by the feeling that the script so short that the probability of
concurrent execution is negligible.
It is a trap that is waiting to spring.
We will talk about proper use of mktemp(1)
later, but in this example no temporary
file is needed at all.
We learned about program composition, right? And we can use it here.
cat logs/20[0-9][0-9]-[01][0-9]-[0-3][0-9].csv | cut -d , -f 5
The |
symbol stands for a pipe, which connects the standard output of cat
to the standard input of cut
. The pipe passes data between the two processes
without writing them to the disk at all. (The data are passed using
memory buffers, but that is a technical detail.)
The result is the same, but we escaped the pitfalls of using temporary files and the result is actually even more readable.
The pipe |
connects standard output of the left-side program with standard
input of the right-side program and shell/OS ensure that the data are flowing
between the two programs.
The programs typically do not know that they are part of a pipe: stdout and stdin are prepared transparently by the system and the programs (or their developers) do not need to care about this.
For cases when the first command also reads from standard input another syntax is available. For example, this prints a sorted list of local user accounts (usernames).
cut -d : -f 1 </etc/passwd | sort
We can even move the first <
before cut
, so that the script can be read
left-to-right like “take /etc/passwd
, extract the first column, and then sort it”:
</etc/passwd cut -d : -f 1 | sort
In essence, the family of unix systems is built on top of the ability of creating pipelines, which chain a sequence of programs using pipes. Each program in the pipeline denotes a type of transformation. These transformations are composed together to produce the final result.
Advancing the running example a bit more
We wanted to print the three most visited URLs first.
Using the pipe above we can print all the URLs in a single list.
To find the most often visited ones we will use a typical trick where we
first sort the lines alphabetically and then use program uniq
with -c
to count unique lines (in effect counting how many times each URL was visited).
We then sort this output by the numbers and print first 3 lines.
In a Pythonic solution, you would probably create a dictionary, key being the URL, value being the counter (how many times the URL was accessed). And then print the keys with highest value. An ugly solution one might hack to make things work could look like this (and this expects that all files are already concatenated):
import sys
urls = {}
for line in map(lambda x: x.rstrip().split(',')[4], sys.stdin):
urls[line] = urls.get(line, 0) + 1
how_many = 3
for url, count in sorted(urls.items(), key=lambda item: item[1], reverse=True):
print("{:7} {}".format(count, url))
how_many = how_many - 1
if how_many <= 0:
break
In shell our program will evolve like this (lines starting with #
are obviously
comments).
# Get all URLs
cat logs/20[0-9][0-9]-[01][0-9]-[0-3][0-9].csv | cut -d , -f 5
# We will make the wildcard shorter to save space
cat logs/*.csv | cut -d , -f 5
# Sort URLs, have same URLs on adjoining lines
cat logs/*.csv | cut -d , -f 5 | sort
# Count number of occurrences (uniq does not sort the file)
cat logs/*.csv | cut -d , -f 5 | sort | uniq -c
# Sort output of uniq numerically (and in reverse)
cat logs/*.csv | cut -d , -f 5 | sort | uniq -c | sort -n -r
# Print first file lines only
cat logs/*.csv | cut -d , -f 5 | sort | uniq -c | sort -n -r | head -n 3
Do not be scared. We advanced by little steps on each line. Run the individual commands yourself and watch how the output is transformed.
Note how the shell solution is easier to debug (once you know the language): you build it little by little while in the Python script it requires extra prints (that you then need to remove) and the solution is much more tightly knotted that the shell one.
Exercise
Print the total amount of transferred bytes using the logs from our running example (i.e., the last part of the task).
Hint: you will need cat
, cut
, paste
and bc
.
First part should be easy: we are interested only in the last column.
cat logs/*.csv | cut -d , -f 4
To sum lines of numbers we will use paste
that is able to merge lines
from multiple files or join lines into a single file.
We will give it separator of +
to create a huge expression
SIZE1+SIZE2+SIZE3+...
.
cat logs/*.csv | cut -d , -f 4 | paste -s -d +
Finally, we will use bc
to sum the lines.
cat logs/*.csv | cut -d , -f 4 | paste -s -d + | bc
bc
alone is a quite powerful calculator than can be used interactively
too (recall that <Ctrl>-D
will terminate the input in interactive mode).
More examples are provided at the end of this lab.
You now know basically everything about pipes. The rest of the magic is the knowledge of available filters (and a few corner cases).
It is like API in Python: the more you know it, the easier it is to build new programs.
Quick check of filters
Redirecting into and inside a script
Consider the following mini-script (first-column.sh
) that extracts and
sorts the first column (for colon-delimited data such as in /etc/passwd
).
Notice there is no input file specification.
#!/bin/bash
cut -d : -f 1 | sort
Then the user can use the script like this and cut
standard input would
be properly wired to the shell standard input or through the pipe.
cat /etc/passwd | ./first-column.sh
./first-column.sh </etc/passwd
head /etc/passwd | ./first-column.sh | tail -n 3
More examples
The following examples can be solved either by executing multiple commands or by piping basic shell commands together. To help you find the right program, you can use manual pages. You can also use our manual as a starting point.
Note that none of the solutions requires anything else than using few pipelines.
For advanced users: definitely you do not need if
or while
or read
or
even using PERL
or AWK
.
The first batch of examples also contains our solution so that you can compare it with yours.
The second batch does not contain solutions but automated tests are available.
Examples with complete solutions
Examples with automated tests
Learning outcomes
Learning outcomes provide a condensed view of fundamental concepts and skills that you should be able to explain and/or use after each lesson. They also represent the bare minimum required for understanding subsequent labs (and other courses as well).
Conceptual knowledge
Conceptual knowledge is about understanding the meaning and context of given terms and putting them into context. Therefore, you should be able to …
-
explain what is a script in a Linux environment
-
explain what is a shebang (hashbang) and how it influences script execution
-
understand the difference when script has or does not have executable bit set
-
explain what is a working directory
-
explain why working directory is private to a running program
-
explain how are parameters (arguments) passed in a script with a shebang
-
explain what is standard input and output
-
explain why standard input or output redirection is not (directly) observable from within the program
-
explain how execution of
cat foo.txt
andcat <foo.txt
differs -
explain how standard inputs/outputs of several programs can be chained together
-
optional: explain why
cd
cannot be a normal executable file like/usr/bin/ls
Practical skills
Practical skills are usually about usage of given programs to solve various tasks. Therefore, you should be able to …
-
create a Linux script with correct shebang
-
set the executable script using the
chmod
utility -
access command-line arguments in a Python program
-
redirect standard input and standard output of a program in shell
-
use standard input and output in Python
-
use the pipe
|
to chain multiple programs together -
use basic text filtering tools:
cut
,sort
, … -
use
grep -F
to filter lines matching provided pattern