Lab #12 | NSWI177 Labs | D3S

Information below is not for the current semester. The current semester can be found here.

Labs: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.

In this lab we will have a look at two useful utilities, namely xargs and find. We will also extend our knowledge about SSH by learning about port forwarding.

But in the majority of the lab we will explore how to develop Python projects in a sandboxed environment that is easily distributed among individual developers in a big software team. We will also see how Python programs can be prepared for further distribution.

Parts about find and xargs are somehow connected, otherwise the three topics (find/xargs, SSH port forwarding and sandboxed development) are independent and can be read in any order. Note that the sandboxed Python development knowledge would be useful for the last homework task.

Preflight checklist

You know what is SSH.
You know what are TCP ports.
You know what is a disk image.
You remember about shell wildcards.
You know that C strings are terminated with zero byte.
You know how Python modules are created and organized.

`xargs` (and `parallel`) utilities

xargs in its simplest form reads standard input and converts it to program arguments for a user-specified program.

Assume we have the following files in a directory:

2024-04-16.txt  2024-04-24.txt  2024-05-02.txt  2024-05-10.txt
2024-04-17.txt  2024-04-25.txt  2024-05-03.txt  2024-05-11.txt
2024-04-18.txt  2024-04-26.txt  2024-05-04.txt  2024-05-12.txt
2024-04-19.txt  2024-04-27.txt  2024-05-05.txt  2024-05-13.txt
2024-04-20.txt  2024-04-28.txt  2024-05-06.txt  2024-05-14.txt
2024-04-21.txt  2024-04-29.txt  2024-05-07.txt  2024-05-15.txt
2024-04-22.txt  2024-04-30.txt  2024-05-08.txt
2024-04-23.txt  2024-05-01.txt  2024-05-09.txt

As a mini-task, write a shell one-liner to create these files.

Solution.

Our task is to remove files that are older than 20 days. In this version, we only echo the command so that we do not need to recreate them again when debugging our solution.

cutoff_date="$( date -d "20 days ago" '+%Y%m%d' )"
for filename in 202[0-9]-[01][0-9]-[0-3][0-9].txt; do
    date_num="$( basename "$filename" .txt | tr -d '-' )"
    if [ "$date_num" -lt "$cutoff_date" ]; then
        echo rm "$filename"
    fi
done

This means that the program rm would be called several times, always removing just one. The overhead of starting a new process could become a serious bottleneck for larger scripts (think about thousands of files, for example).

It would be much better if we would call rm just once, giving it a list of files to remove (i.e., as multiple arguments).

xargs is the solution here. Let’s modify the program a little bit:

cutoff_date="$( date -d "20 days ago" '+%Y%m%d' )"
for filename in 202[0-9]-[01][0-9]-[0-3][0-9].txt; do
    date_num="$( basename "$filename" .txt | tr -d '-' )"
    if [ "$date_num" -lt "$cutoff_date" ]; then
        echo "$filename"
    fi
done | xargs echo rm

Instead of removing the file right away, we just print its name and pipe the whole loop to xargs where any normal arguments refer to the program to be launched.

Instead of many lines with rm ... we will se just one long line with single invocation of rm.

Another situation where xargs can come handy is when you are building a complex command-line or when using command substitution ($( ... )) would make the script unreadable.

Of course, tricky filenames can still cause issues as xargs assumes that arguments are delimited by whitespace. (Note that for above, we were safe as the filenames were reasonable.) That can be changed with --delimiter.

If you are piping input to xargs from your program, consider delimiting items with zero byte (i.e., the C string terminator, \0). Recall what you have heard about C strings – and how they are terminated – in your Arduino course. That is the safest option as this character cannot appear anywhere inside any argument. And tell xargs about it via -0 or --null.

Note that xargs is smart enough to realize when the command-line would be too long and splits it automatically (see manual for details).

It is also good to remember that xargs can execute the command in parallel (i.e., split the stdin into multiple chunks and call the program multiple times with different chunks) via -P. If your shell scripts are getting slow but you have plenty of CPU power, this may speed things up quite a lot for you.

`parallel`

This program can be used to execute multiple commands in parallel, hence speeding up the execution.

parallel behaves almost exactly as xargs but has much better support for concurrent execution of individual jobs (not mixing their output, execution on a remote machine etc. etc.).

The differences are rather well described in parallel documentation.

Please, also refer to parallel_tutorial(1) (yes, that is a man page) and for parallel(1) for more details.

`find`

While ls(1) and wild-card expansion are powerful, sometimes we need to select files using more sophisticated criteria. There comes the find(1) program useful.

Without any arguments, it lists all files in current directory, including files in nested directories.

Do not run it on root directory (/) unless you know what you are doing (and definitely not on the shared linux.ms.mff.cuni.cz machine).

With -name parameter you can limit the search to files matching given wildcard pattern.

Following command finds all alpha.txt files in current directory and in any subdirectory (regardless of depth).

find -name alpha.txt

Why the following command for finding all *.txt files would not work?

find -name *.txt

Hint. Answer.

find has many options – we will not duplicate its manpage here but mention those that are worth remembering.

-delete immediately deletes the found files. Very useful and very dangerous.

-exec runs a given program on every found file. You have to use {} to specify the found filename and terminate the command with ; (since ; terminates commands in shell too, you will need to escape it).

find -name '*.md' -exec wc -l {} \;

Note that for each found file, new invocation of wc happens. This can be altered by changing the command terminator (\;) to +. See the difference between invocation of the following two commands:

find -name '*.md' -exec echo {} \;
find -name '*.md' -exec echo {} +

Caveats

By default, find prints one filename per-line. However, filename can even contain the newline character (!) and thus the following idiom is not 100% safe.

find -options-for-find | while read filename; do
    do_some_complicated_things_with "$filename"
done

If you want to be really safe, use -print0 and IFS= read -r -d $'\0' filename as that would use the only safe delimiter – \0 Alternatively, you can pipe the output of find -print0 to xargs --null.

However, if you are working with your own files or the pattern is safe, the above loop is fine (just do not forget that directories are files too and they can contain \n in their names too).

Shell also allows you to export a function and call back to it from inside xargs. The invocation pattern looks awful but it is a safe approach if you want to execute a complex operation on top of found files.

my_callback_function() {
    echo ""
    echo "\$0 = $0"
    echo "\$@ =" "$@"
}
export -f my_callback_function

find . -print0 | xargs -0 -n 1 bash -c 'my_callback_function "$@"' arg_zero arg_one

Obviously, name your function properly based on what it does. Our name has one advantage – it is clearly visible in all three places where the identifier is used.

Recall that you can define functions directly in shell and the above can be actually created interactively without storing it as a script.

SSH port forwarding

Generally, services provided by a machine should not be exposed over the network for random “security researchers” to play with. Therefore, a firewall is usually configured to control access to your machine from the network.

If a service should be provided only locally, it is even easier to let it listen on the loopback device only. This way, only local users (including users connected to the machine via SSH) can access it.

As an example, you will find that there is a web server listening on port 8080 of linux.ms.mff.cuni.cz. This web server is not available when you try to access it as linux.ms.mff.cuni.cz, but accessing it locally (when logged to linux.ms.mff.cuni.cz) works.

you@laptop$ curl http://linux.ms.mff.cuni.cz:8080                # Fails
you@laptop$ ssh linux.ms.mff.cuni.cz curl --silent http://localhost:8080  # Works

While using cURL to access this web server is possible, it is not the most user-friendly way to browse a web page.

Local Port Forwarding

SSH can be used to create a secure tunnel, through which a local port is forwarded to a port accessible from the remote machine. In essence, you will connect to a loopback device on your machine and SSH will forward that communication to the remote server, effectively making the remote port accessible.

The following command will make local port 8888 behave as port 8080 on the remote machine. The 127.0.0.1 part refers to the loopback on the remote server (you can write localhost there, too.)

ssh -L 8888:127.0.0.1:8080 -N linux.ms.mff.cuni.cz

You always first specify which local port to forward (8888) and then the destination as if you were connecting from the remote machine (127.0.0.1:8080).

The -N makes this connection usable only for forwarding – use Ctrl-C to terminate it (without it, you will log in to the remote machine, too).

Open http://localhost:8888 in your browser to check that you can see the same content as with the ssh linux.ms.mff.cuni.cz curl http://localhost:8080 command above.

You will often forward (local) port N to the same (remote) port N hence it is very easy to forgot about the proper order. However, the ordering of -L parameters is important and switching the numbers (e.g. 8888:127.0.0.1:9090 instead of 9090:127.0.0.1:8888) will forward different ports (usually, you will learn about it pretty quickly, though).

But do not worry if you are unable to remember it. That is why you have manual pages and even every-day users of Linux use them. It is not something to be ashamed or afraid of :-).

Remote/Reverse Port Forwarding

SSH allows to create also a so-called remote port forward.

It basically allows you to open a connection from the remote server to your local machine (in reverse to the ssh connection).

Practically, you can set up a remote port forwarding by connecting from your desktop you have at home to a machine in IMPAKT/Rotunda, for example, and then use it to connect from IMPAKT/Rotunda back to your desktop.

This feature will work even if your machine is behind NAT, which makes direct connections from the outside impossible.

The following command sets the remote port forwarding such that connecting to port 2222 on the remote machine will be translated to connection to port 22 (ssh) on the local machine:

ssh -N -R 2222:127.0.0.1:22 u-plN.ms.mff.cuni.cz

You first specify the remote port to forward (2222) and then the destination as if you were connecting from the local machine (127.0.0.1:22).

When trying this, ensure that your sshd daemon is running (recall lab 10 and systemctl command) and use a different port than 2222 to prevent collisions.

In order to connect to your desktop via this port forward, you have to do so from IMPAKT/Rotunda lab via the following command.

ssh -p 2222 your-desktop-login@localhost

We use localhost as the connection is only bound to the loopback interface, not to the actual network adapter available on lab computers. (Actually, ssh allows to bind the port forward on the public IP address, but this is often disabled by the administrator for security reasons.)

Sandboxed software development

In one of the previous labs, we have showed that the preferred way of installing applications (and libraries and data files) on Linux is via the package manager. It installs the application for all users, it allows system-wide upgrades, and it generally keeps your system in a much cleaner state.

However, system-wide installation may not be always suitable. One typical example are project-specific dependencies. These are often not installed system-wide, mainly for the following reasons:

You need different versions of dependencies for different projects.
You do not want to remember to uninstall them when you stop working on the project.
You want to control when you upgrade them: an upgrade of the OS should not affect your project.
The versions you need are different from those available through the package manager.
Or they may not be packaged at all.

For the above reasons, it is much better to create a project-specific installation that is better isolated from the system. Note that installing the dependency per-user (i.e., somewhere into $HOME) may not provide the isolation you wish to achieve.

Such approach is supported by most reasonable programming languages and can be usually found under names such as virtual environment, local repository, sandbox or similar (note that the concepts do not map 1:1 across languages and tools, but the general idea remains the same).

With a virtual environment, your dependencies are usually installed into a specific directory inside your project, kept outside version control. The compiler/interpreter is then told to use this location.

The directory-local installation then keeps your system clean. It also allows working on multiple projects with incompatible dependencies, because they are completely isolated.

The installation directory is rarely committed to your Git repository. Instead, you commit a configuration file that specifies how to prepare the environment.

Each developer can then recreate the environment without polluting the main repository with distribution-specific or even OS-dependent files. Yet the configuration file ensures that all developers will be working in the same environment (i.e., same versions of all the dependencies).

It also means that new members of software teams can easily set up their environment using the provided configuration file.

Dependency installation

Inside the virtual environment, the project usually does not use generic package managers (such as DNF). Instead, they install dependencies using language-specific package managers.

These are usually cross-platform and use their own software repository. Such repository then hosts only libraries for that particular language. Again, there can be multiple such repositories and it is up to the developers how they configure their projects

Technically, language-specific package managers can also install the packages system-wide, competing with distribution-specific package managers. It is up to the administrator to handle this reasonably. This usually involves defining a clear boundary between areas maintained by the distribution-specific manager and those maintained by the language-specific ones.

In our scenario, the language-specific managers would install only into the virtual environment directory without ever touching the system itself.

Installation directories

On a typical Linux system, there are multiple places where software can be installed:

/usr – system packages handled by the distribution’s package manager
/usr/local – software installed locally by the administrator; language-specific managers usually install system-wide packages there
/opt/$PACKAGE – large packages installed outside distribution’s package manager often live in their own sub-directory inside /opt.
$HOME (usually /home/$USER/) – language-specific managers run by non-root users can install packages locally to their home directory (to language-specific sub-directories).
$HOME/.local is a favourite place for local installation that generally mirrors /usr/local but for a single user only (executables are then placed inside $HOME/.local/bin)
per-project virtual environments

Python Package Index (PyPI)

The rest of the text will focus mostly on Python tools supporting the above-mentioned principles. Similar tools are available for other languages, but we believe that demonstrating them on Python is sufficient to understand the principles in practice.

Python has a repository called the Python Package Index (PyPI) where anyone can publish their Python programs and/or libraries.

The repository can be used through a web browser, but also through a command-line client called pip.

pip behaves rather similar to DNF. You can use it to install, upgrade, or uninstall Python modules.

When run with superuser privileges, it is able to install packages system-wide. Do not use it like that unless you know what you are doing and you understand the consequences.

Issues of trust

In your distributions upstream package repository, all packages typically has to be reviewed by someone from the distribution’s security team. This is sadly not true for the PyPI or similar repositories. This said, you as a developer must be more cautious when installing from such sources.

Not all packages do what they claim to. Some are just innocently buggy, but some are outright malicious. Re-using other people’s code is generally a good practice, but you should give a thought to the trustworthiness of the author. After all, the code will be executed under your account either when you run your program or as a part of the installation process.

In particular, criminals like to publish malicious packages, whose name differs from a well-known package by a single typo. This is called typosquatting. You might read more for example in this blogpost, but searching the web will yield more results.

On the other hand, many PyPI packages are also available as packages for your distribution (feel free to try dnf search python3- on your Fedora box). Hence they probably were reviewed by distribution maintainers and are probably safe to use. For packages not available for your distribution natively, always look for tell-tales of normal vs malicious project. Popularity of the source code repository. User activity. Reactions to bug reports. Documentation quality. Etc. etc.

Recall that modern software is rarely built from scratch. Do not be afraid to explore what is available. Check it. And use it :-).

Typical workflow practically

While the actual tools will differ across different programming languages, the general steps for developing project in some kind of a sandbox are generally the same.

The developer clones the project (e.g., from a Git repository).
The sandbox (virtual environment) is initialized. Usually this means that a new directory with a fresh language environment is created.
The virtual environment must be activated. Often the virtual environment needs to modify $PATH (or rather some language-specific variant of such path that is used to search for libraries or modules), so the developer must source (or .) some activation script that modifies the path.
Then the developer can install dependencies of the project. They are usually stored in a file that can be passed to the package manager (of the given programming language).
Only now the developer can actually work on the project. The project is fully isolated, removing the virtual environment directory removes all traces of the installed packages.

Everyday job then often involves only steps 3 (some kind of activation) and step 5 (actual development).

Note that activation of the virtual environment typically removes access to libraries installed globally. That is, inside the virtual environment, the developer starts with a fresh and clean environment with a bare compiler. That is actually a very sane decision as it ensures that system-wide installation does not affect the project-specific environment.

In other words, it improves on reproducibility of the whole setup. It also means that the developer needs to specify every dependency into the configuration file even if the dependency can be considered as one of those that are usually present everywhere.

Virtual environment for Python (a.k.a. `virtualenv` or `venv`)

To try installing Python packages safely, we will first setup a virtual environment for our project. Fortunately, Python has built-in support for creating a virtual environment.

We will demonstrate this on the following example:

#!/usr/bin/env python3

import argparse
import shutil
import sys

import fs

class FsCatException(Exception):
    pass

def fs_cat(filesystem, filename, target):
    try:
        with fs.open_fs(filesystem) as my_fs:
            try:
                with my_fs.open(filename, 'rb') as my_file:
                    shutil.copyfileobj(my_file, target)
            except fs.errors.FileExpected as e:
                raise FsCatException(f"{filename} on {filesystem} is not a regular file") from e
            except fs.errors.ResourceNotFound as e:
                raise FsCatException(f"{filename} does not exist on {filesystem}") from e
    except Exception as e:
        if isinstance(e, FsCatException):
            raise e
        raise FsCatException(f"unable to read {filesystem}, perhaps misspelled path or protocol ({e})?") from e


def main():
    args = argparse.ArgumentParser(description='Filesystem cat')
    args.add_argument(
        'filesystem',
        nargs=1,
        metavar='FILESYSTEM',
        help='Filesystem specification, e.g. tar://path/to/file.tar'
    )
    args.add_argument(
        'filename',
        nargs=1,
        metavar='FILENAME',
        help='File path on FILESYSTEM, e.g. /README.md'
    )

    config = args.parse_args()

    try:
        fs_cat(config.filesystem[0], config.filename[0], sys.stdout.buffer)
    except FsCatException as e:
        print(f"Fatal: {e}", file=sys.stderr)
        sys.exit(1)

if __name__ == "__main__":
    main()

Save this snippet into fscat.py and set the executable bit. Note that fs.open_fs is able to open various filesystems and access files on them like if you use the builtin Pythonic open.

In our program, we provide path to a filesystem and a file (residing on this filesystem) to print to the screen (hence the name, fscat as it simulates cat inside a different filesystem).

Make sure you understand the whole program before continuing.

Try running the fscat.py program.

Unless you have already installed the python3-fs package system-wide, it should fail with ModuleNotFoundError: No module named 'fs'. The chances are that you do not have that module installed.

If you have installed the python3-fs, uninstall it now and try again (just for this demo). But double-check that you would not remove some other program that may require it.

We could now install the python3-fs with DNF but we already described why that is a bad idea. We could also install it with pip globally but that is not the best course of action either.

Instead, we will create a new virtual environment for it.

python3 -m venv my-venv

The above command creates a new directory my-venv that contains a bare installation of Python. Feel free to investigate the contents of this directory.

We now need to activate the environment.

source my-venv/bin/activate

Your prompt should have changed: it is prefixed by (my-venv) now.

Running fscat.py will still terminate with ModuleNotFoundError.

We will now install the dependency:

pip install fs

This will take some time as Python will also download transitive dependencies of this library (and their dependencies etc.). Once the installation finishes, run fscat.py again.

This time, it should work.

./fscat.py

Okay, it printed an error message about required arguments. Download this tarball and run the script as follows:

./fscat.py tar://test.tar.gz testdir/test.txt

It should print Test string as it is able to even handle tarballs as filesystems and print files on them (verify that the file is really there using either atool, MC or using tar directly).

Once we are finished with the development, we can deactivate the environment by calling deactivate (this time, without sourcing anything).

Running fscat.py outside the environment shall again terminate with ModuleNotFoundError.

How does it work?

Python virtual environment uses two tricks in its implementation.

First, the activate script extends $PATH with the my-venv/bin directory. That means that calling python3 will prefer the application from the virtualenv’s directory (e.g. my-venv/bin/python3).

Try this yourself: print $PATH before and after you activate a virtualenv.

This also explains why we should always specify /usr/bin/env python3 in the shebang instead of /usr/bin/python3. env will consult $PATH that was modified by the activation of the virtualenv.

You can also view the activate script and see how this is implemented. Note that deactivate is actually a function.

Why is the activate script not executable? Hint.

The second trick is that Python searches for modules (i.e., for files implementing an imported module) relative to the path of the python3 binary. Hence, when python3 is inside my-venv/bin, Python will look for the modules inside my-venv/lib. That is the location where your locally installed files will be placed.

You can check this by executing the following one-liner that prints Python search directories (again, before and after activation):

python3 -c 'import sys; print(sys.path)'

This behaviour is actually not hard-wired in the Python interpreter. When Python starts up, it automatically imports a module called site. This module contains site-specific setup: it adjusts sys.path to include all directories where your distribution installs Python modules. It also detects virtual environments by looking for the pyvenv.cfg file in the grandparent directory of the python3 binary. In our case, this configuration file contains include-system-site-packages=false, which tells the site module to skip distribution’s module directories. You can see that the principle is very simple and the interpreter itself needs to know nothing about virtual environments.

Installing Python-specific packages with `pip`

`pip` VS. `python3 -m pip`?

Generally, it is recommended to use python3 -m pip, rather than raw pip. Reasons behind these additional 10 key strokes are well described in Why you should use python3 -m pip. However, in order to make the following text more readable, we will use the shorter pip variant.

We have already seen one usage of pip in practice, but pip can do much more. The nice walkthrough over all pip capabilities can be found in Using Python’s pip to Manage Your Projects’ Dependencies.

Here we provide a brief summary of the most important concepts and commands.

By default, pip install is searching through the package registry PyPI, in order to install the package specified in the command-line. We wouldn’t be far from truth, by saying that all packages inside this registry are just archived directories, which contain Python source code organized in a prescribed way.

If you would like to change this default package registry, you can use the --index-url argument.

As you are already familiar with GitLab, you could be interested in GitLab PyPI Package Registry Support.

In a later section, we will learn how to turn a directory with code into a proper Python package. Assuming that we have already done it, we can install that package directly (without archiving/packing) by running pip install /path/to/python_package.

For example, imagine a situation where you are interested in a third-party open-source package. This package is available in a remote git repository (typically on GitHub or GitLab), but it is NOT packed and published in PyPI. You can simply clone the repository and run pip install .. However, thanks to pip VCS Support, you can avoid the cloning phase and install the package directly with:

pip install git+https://git.example.com/MyProject

In order to upgrade a specific package, you run pip install --upgrade [packages].

Finally, for removing package you run pip uninstall [packages].

Dependency versioning

You might have heard about semantic versioning. Python uses a more or less compatible versioning, which is described in PEP 440 – Version Identification and Dependency Specification.

When you install dependencies from the package registry, you can specify this version.

pkgname          # latest version
pkgname == 4.2   # specific version
pkgname >= 4.2   # minimal version
pkgname ~= 4.2   # equivalent to >= 4.2, == 4.*

Truth is that a version specifier consists of a series of version clauses, separated by commas. Therefore you can type:

pkgname >= 1.0, != 1.3.4.*, < 2.0

Sometimes it is helpful to save a list of all currently installed packages (including transitive dependencies). For example, you have recently noticed a new bug in your project and you would like to keep record of the precise version of currently installed dependencies, so that your co-worker can reproduce the bug.

In order to do that, it is possible to use pip freeze and create a list that sets specific versions, ensuring the same environment for every developer.

It is recommended to store these in requirements.txt file.

# Generating requirements file
pip freeze > requirements.txt

# Installing package from it
pip install -r requirements.txt

Packaging Python Projects

Let’s say that you come up with a super cool algorithm and you want to enrich the world by sharing it. Python official documentation offers a step-by-step tutorial on how to achieve it.

In following text, we are going to use setuptools for building the Python projects. Historically, this was the only option how to build a Python package. Recently, Python developers decided to open gates for alternatives and so you may also build a Python package with Poetry, flit or others. The description of these tools is out of the scope of this course.

Python Package Directory Structure

The very first step, before you can publish it, is to transform it into a proper Python package. We need to create files called pyproject.toml and setup.cfg. These files contain information about the project, a list of dependencies, and also information for project installation.

Not long ago, it was usual to have setup.py script, rather that setup.cfg and pyproject.toml. Therefore, in many repositories/tutorials you can still find usage of it. The content is more or less 1:1, but there are certain cases, in which you are forced to use setup.py. Fortunately, this is not applicable for our usecase and so we have decided to describe the modern variant with static configuration files.

As is written in setuptools Quickstart, since version 61.0.0, setuptools offers the experimental usage of having only a pyproject.toml. This approach is also used by Poetry, but in the following text, we will stay with the stable combination of setup.cfg and pyproject.toml.

In fscat, you can find a Python package with the same functionality as our previous fscat.py script.

Please study carefully the directory structure as well as the content of setup.cfg.

One may notice that the necessary dependencies are duplicated in setup.cfg and in requirements.txt. Actually, this is not a mistake. In setup.cfg, you should use the most possible relaxed version of the dependency, whereas in requirements.txt we need to specify all dependencies with a precise version. There are also the transitive dependencies, which should NOT be present in setup.cfg.

For more details, see install_requires vs requirements file.

Try to install this package with VCS Support with following command:

pip install git+http://gitlab.mff.cuni.cz/teaching/nswi177/2024/common/fscat.git

You perhaps noticed that the setup.cfg file contained the section [options.entry_points]. This section specifies what the actual scripts of your project are. Note that after running the above command, you can execute the fscat command directly. Pip created a wrapper script for you and added it to the sandbox $PATH.

fscat tar://tests/test.tar.gz testdir/test.txt

Now uninstall the package with:

pip uninstall matfyz-nswi177-fscat

Clone the repository to your local machine and change directory to it. Now run:

pip install -e .

pip install -e produces an editable installation for easy debugging. Instead of copying your code to the virtual environment, it installs only a symlink-like thing (actually, an fscat.egg-link file, which has a similar effect on Python’s mechanism for finding modules) referring to the directory with your source files.

Building a Python package

Now, when we already have the proper directory structure, we are only two steps from publishing it to Package Registry.

Now, we prepare distribution packages for our code. First, we install the build package by invoking pip install build. Then we can run

python3 -m build

Two files are created in the dist subdirectory:

matfyz-nswi177-fscat-0.0.1.tar.gz – a source code archive
matfyz_nswi177_fscat-0.0.1-py3-none-any.whl – a wheel file, which is the built package (py3 is the Python version required, none and any tell that this is a platform-independent package).

Note that the wheel file is nothing more that a simple Zip archive.

$ file dist/matfyz_nswi177_fscat-0.0.1-py3-none-any.whl
dist/matfyz_nswi177_fscat-0.0.1-py3-none-any.whl: Zip archive data, at least v2.0 to extract, compression method=deflate

$ unzip -l dist/matfyz_nswi177_fscat-0.0.1-py3-none-any.whl
Archive:  dist/matfyz_nswi177_fscat-0.0.1-py3-none-any.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
       51  2024-04-24 10:48   fscat/__init__.py
      837  2024-04-24 10:48   fscat/fscat.py
      777  2024-04-24 10:48   fscat/main.py
     1075  2024-04-24 10:53   matfyz_nswi177_fscat-0.0.1.dist-info/LICENSE
     1173  2024-04-24 10:53   matfyz_nswi177_fscat-0.0.1.dist-info/METADATA
       92  2024-04-24 10:53   matfyz_nswi177_fscat-0.0.1.dist-info/WHEEL
       42  2024-04-24 10:53   matfyz_nswi177_fscat-0.0.1.dist-info/entry_points.txt
        6  2024-04-24 10:53   matfyz_nswi177_fscat-0.0.1.dist-info/top_level.txt
      769  2024-04-24 10:53   matfyz_nswi177_fscat-0.0.1.dist-info/RECORD
---------                     -------
     4822                     9 files

You may wonder, why there are two archives with very similar content. The answer can be found in What Are Python Wheels and Why Should You Care?.

You can now switch to a different virtualenv and install the package using pip install package.whl.

Publishing a Python package

If you think that the package could be useful to other people, you can publish it in the Python Package Index. This is usually accomplished using the twine tool. The precise steps are described in Uploading the distribution archives.

Creating distribution packages (e.g. for DNF)

While the work for creating the project files may seem to complicate things a lot, it actually saves time in the long run.

Virtually any Python developer would be now able to install your program and have a clear starting point when investigating other details.

Note that if you have installed some program via DNF system-wide and that program was written in Python, somewhere inside it, there was setup.cfg that looked very similar to the one you have just seen. Only instead of installing the script into your virtual environment, it was installed globally.

There is really no other magic behind it.

Note that for example Ranger is written in Python and this script describes its installation (it is a script for creating packages for DNF). Note that %py3_install is a macro that actually calls setup.py install.

Higher-level tools

We can think of pip and virtualenv as low-level tools. However, there are also tools that combine both of them and bring more comfort to package management. In Python, there are at least two favorite choices, namely Poetry and Pipenv.

Internally, these tools use pip and venv, so you are still able to have independent working spaces as well as the possibility to install a specific package from the Python Package Index (PyPI).

The complete introduction of these tools is out of the scope for this course. Generally, they follow the same principles, but they add some extra functions that are nice to have. Briefly, the major differences are:

They can freeze specific versions of dependencies, so that the project builds the same on all machines (using poetry.lock file).
Packages can be removed together with their dependencies.
It is easier to initialize a new project.

Other languages

Other languages have their own tools with similar functions:

Ruby has bundler.
Julia has Pkg.
Rust has cargo.
JavaScript has npm.
…

Tasks to check your understanding

We expect you will solve the following tasks before attending the labs so that we can discuss your solutions during the lab.

Compress all .csv files that are greater than 1M using GZ (gzip).

Do not overwrite existing files without prompting.

Hint.

Solution.

Recall our static site generator from Lab 8. In it we had the following loop to generate HTML file from Markdown source.

generate_web() {
    local page
    for page in src/*.md; do
        if ! [ -f "$page" ]; then
            continue
        fi
        build_markdown_page "$page"
    done
}

Update the implementation to support nested subdirectories as well.

Solution.

Write a Python program that uses the packages roman and dateparser to print the date specified by the user in Roman numerals.

The program is best described by examples provided below (assuming they were executed on April 24, 2024).

./romandate.py
XXIV.IV.MMXXIV
./romandate.py 2021-01-01
I.I.MMXXI
./romandate.py 40 years ago
XXIV.IV.MCMLXXXIV

The tests assume that they are already executed inside a virtual environment where the above-mentioned packages are installed (when executed on GitLab, the tests installs these two packages for you automatically).

Do not forget to check your solution that it also works

when executed without parameters (time now)
when executed with relative dates such as 5 days ago

This example can be checked via GitLab automated tests. Store your solution as 12/romandate.py and commit it (push it) to GitLab.

Prepare a Python package that provides a get-project-name command that tries to auto-detect project name.

The program will look into README.md and README files for the first non-empty line (stripping extra whitespace and leading # in *.md files).

When neither README.md or README are present, the program will try to find the top directory of a Git project (consider using the search_parent_directories=True constructor parameter and the working_tree_dir property of Repo from GitPython) and print its basename.

If the current directory is not a part of a Git project, the program will print the basename of the current directory.

We expect that the following would work (probably best executed in a virtual environment) in the root of your submission repository.


get-project-name
# Prints 'NSWI177 Tasks Repository'
cd 01
get-project-name
# Prints 'NSWI177 Tasks Repository'
cd ../../
get-project-name
# Prints directory name of the parent directory of your submission repository clone

We expect that you will setup a proper src subdirectory and organize your package properly using setup.cfg etc.

The automated tests always create a new virtual environment for each test case. That is good for final check. But it is also possible to execute the tests inside activated virtual environment where they expect that the get-project-name command is already installed by setting NSWI177_LAB12_NO_INSTALL=true (i.e., they skip the pip install 12/get-project-name part which makes them much faster):

env NSWI177_LAB12_NO_INSTALL=true ./bin/run_tests.sh 11-post/project_name

This example can be checked via GitLab automated tests. Store your solution as 12/get-project-name/ and commit it (push it) to GitLab.

Learning outcomes

Learning outcomes provide a condensed view of fundamental concepts and skills that you should be able to explain and/or use after each lesson. They also represent the bare minimum required for understanding subsequent labs (and other courses as well).

Conceptual knowledge

Conceptual knowledge is about understanding the meaning and context of given terms and putting them into context. Therefore, you should be able to …

explain the difference between a normal SSH port forward and a reverse port forward
explain what are requirements (library dependencies)
explain fundamentals of semantic versioning
explain what are pros and cons of installing dependencies system-wide vs installing them in a sandboxed environment
provide a high-level overview of a sandbox environment
explain pros and cons of specifying transitive requirements vs specification of top-level ones only
explain pros and cons of using exact versions vs minimal requirements

Practical skills

Practical skills are usually about usage of given programs to solve various tasks. Therefore, you should be able to …

use xargs program
use find with basic predicates (-name, -type) and actions (-exec, -delete)
use SSH port forward to access service available on loopback device
use reverse SSH port forward to connect to a machine behind a NAT
create a new virtual environment for Python using python3 -m venv
activate and deactivate virtual environment
install project dependencies in a virtual environment with pip
develop program inside a virtual environment (with projects using setup.cfg and pyproject.toml files)
install Python project from its setup.cfg
optional: setup Python project for installation

Notice how date can be use (ab)used for computing the right date.

for i in $( seq 1 30 ); do echo ""> "$( date -d "2024-04-15 + $i days" +'%Y-%m-%d.txt' )"; done

find -name '*.csv' -type f -size +$(( 1024 * 1024 ))c -exec gzip {} \;

Note that the default unit is in 512 blocks (yes, that is not very practical) and the rounding rules are also rather counter-intuitive.

gzip will not overwrite existing files hence nothing more needs to be done.

There are various approach how to change the implementation, our solution uses find with -print0 and modified while read loop.

build_markdown_page() {
    local input_file="$1"
    local output_dir="$( dirname "$input_file" | sed 's:^src:public:' )"
    local output_file="$output_dir/$( basename "$input_file" ".md" ).html"
    mkdir -p "$output_dir"
    $LOGGER "Generating $input_file => $output_file"
    pandoc_as_filter "$input_file" >"$output_file"
}

...

generate_web() {
    find src -type f -name '*.md' -print0 | while IFS= read -r -d $'\0' filename; do
        build_markdown_page "$filename"
    done
}

Preflight checklist

xargs (and parallel) utilities

parallel

find

Caveats

SSH port forwarding

Local Port Forwarding

Remote/Reverse Port Forwarding

Sandboxed software development

Dependency installation

Installation directories

Python Package Index (PyPI)

Issues of trust

Typical workflow practically

Virtual environment for Python (a.k.a. virtualenv or venv)

How does it work?

Installing Python-specific packages with pip

pip VS. python3 -m pip?

Dependency versioning

Packaging Python Projects

Python Package Directory Structure

Building a Python package

Publishing a Python package

Creating distribution packages (e.g. for DNF)

Higher-level tools

Other languages

Tasks to check your understanding

Learning outcomes

Conceptual knowledge

Practical skills

`xargs` (and `parallel`) utilities

`parallel`

`find`

Virtual environment for Python (a.k.a. `virtualenv` or `venv`)

Installing Python-specific packages with `pip`

`pip` VS. `python3 -m pip`?