Managing dependencies for reproducible (scientific) software
Fear the dependencies of your dependencies
Using code written by others can save a lot of time, but we are generally better at managing our own code. We diligently commit our code into version control systems like git and write automated tests, but do not at all control (or even list) the code that is written by third-parties. Focusing on personal-code management vs external-code makes sense when the external code changes much less frequently than your internal code. When using a stable commercial product like Matlab, your personal code will likely change more than the external code, so it makes sense to devote more attention to it by learning git. However, an exclusive focus on managing your own code does not work in faster moving worlds like scientific Python.
Like many newer computing platforms (e.g. Web 2.0), the scientific Python ecosystem is powerful, but there is constant churn in the underlying code ecosystem. Marquis packages like pandas and xarray are constantly being improved—and sometimes broken. For example, pandas finally issued a major release (v1.0.0) in July, a crowning achievement…which happened to break xarray, a bug which was fixed in v1.0.1. It’s not just python. For example, I first set up this blog a couple of years ago with hugo and forgot what version I used and didn’t take any notes. When I started writing this post, I installed Hugo on my new desktop, and…promptly hit obscure error messages. Turns out my blog is compatible with v0.55 of Hugo, but not v0.68…go figure. These tedious errors which break working code are frustrating and challenging to debug.
In this way, the open-source ecosystem presents some challenges to scientific reproducibility that did not exist in more stable proprietary environments. To be honest, I expect the typical scientist using MATLAB will be able to reproduce their computational experiments in 5 years without much work. They likely don’t use any external packages, and stick to the APIs available in their platform. On the other hand, even experienced python users might not know how to insulate their projects from the maelstrom of open source development—and may not even understand the need. I know I didn’t until a few months ago. Also, experienced python users are often more likely to seek out bleeding-edge software. Hopefully, this motivates you to develop a strategy for managing dependencies in your software projects.
Dependency Management Strategies
Managing external dependencies is harder than managing personal code for a
few reasons. First, there is much more external code than personal
code. Second, there is no single universally agreed-upon dependency
management tool like
git for personal code management. However, the same
strategies used to build robust personal code can also be used for
other people’s code.
Any software project can be made more robust by
- being simpler (e.g. reducing the number of things it does/number of dependencies)
- widely reusing only dependable/stable code (e.g. avoid v0.1.0 libraries)
- loose coupling (e.g. avoid call
np.xxxxin 1000 places).
Overall, you can manage dependencies using the same strategy you use to avoid getting COVID-19: interact only with your household (your code), keep a social distance with friends (especially helpful dependencies), and avoid large groups of strangers.
It’s harder to manage dependencies spanning multiple ecosystems (e.g. the Linux shell, python, or other languages).
A common way this happens is when a python or other script calls a command line tool like
tar. This can be a really
For example, familiarity with
tar may drive you to write something like this:
import subprocess subprocess.check_call(['tar', 'xf', "--exclude", "unwanted/data", "my_data])
but now your program depends on a system tool which differs on
Mac, so this script may not work if you upgrade your laptop or share it with a colleague.
It’s much better to learn how to use python’s builtin
Applications and libraries manage their dependencies differently. Applications are pieces of software that are intended to be executed by a user to produce a specific result. For example, the web-servers behind amazon.com are applications. So is the data analysis code for my last paper and the Community Earth System Model. On the other hand, libraries are software artifacts that are shared and imported into other pieces of code, such as applications. For example, I used the python library PyTorch in my last paper, and CESM uses the netCDF library to save data.
Library developers cannot control the environments where their code is executed,and libraries are useless if they don’t work with other libraries. To take well known scientific python libraries as an example, if pandas only worked with numpy v1.15.0, but scikit-learn required numpy v1.14.2, then scikit-learn and pandas could not be installed together. Because of this, libraries usually only loosely constrain their dependencies (e.g. use versions more recent than numpy 1.15). These constraints are NOT a promise that the library will work in your software environment since no library developer can test their library against all conceivable combinations of dependencies satisfying these loose constraints. Users will inevitably have to step through this mine-field themselves.
Application developers, on the other hand, often have complete control where their application runs. In the scientific context, often the “application developer” writing the data analysis code and the scientific “user” are the same person. In the broader context, the modern Software as a Service (SaaS) provider usually uses a client-server architecture where the client is on a restricted platform (e.g. the browser, iOS, etc) and the server is completely controlled by the service provider. This, and the fact that applications are the end-users of libraries, gives most application developers the power to simply freeze the churning ocean of their dependencies by pinning or locking the versions of those libraries used. 3
I now turn to the tools that developers can use to manage their dependencies.
Features for Managing Dependencies
You have likely installed software with a package manager.
These generally fall into two categories.
Some are python-specific like
pip, and others can also manage non-python dependencies e.g. Linux’s
apt-get or Anaconda’s
pip install numpy
will install NumPy and all its dependencies.
In many cases, a single package manager might not be enough, so one will need to use the system package manager to install tools like compilers, and
pip to install python packages. This is the problem
conda is trying to solve, but it is not immune itself since there are still packages on PyPi that do not have anaconda packages.
Therefore, interoperability with
pip is important.
Dependency resolution involves installing all the dependencies required by a given package. An even more complicated problem occurs when two packages depend on the same library. What version of this library should they choose? Some package managers put great care into solving the dependency graph such that all version constraints advertised by any package in the installed set are satisfied. For instance, if package A requires package Z of version >= 1.1 and package B requires Z of version <=2, then the tool will install any version between 1.1 and 2. This feature captures some bugs and can help ensure that the APIs used are consistent, which is why package managers like new versions of pip4 and conda include version constraint resolvers.
Of course, sometimes packages lie, and fail to work with every version of a
dependency that they claim to. The process of satisfying all the
version constraints is only a heuristic. Someone needs
to actually test that this set of installed versions works. You can do this via
your own tests, but we are lucky that numerous organizations do this hard work
for us by releasing and maintaining distributions of packages that are tested
to work together. For example, the Debian Linux
Distribution has thousands of python packages
available and only issues stable
releases every few years that only
have small security patches. Therefore, running
apt-get install python3-xarray python3-sklearn on a Debian 10.7 computer will install a nearly identical set
packages until the Debian project stops hosting the packages in 20 years or
so5. Similarly, you can install specific versions of the Anaconda Python
Most tools have the ability to print out a list of installed packages and their
versions. For instance, one can call
conda env export, or
apt list --installed depending on your system. Saving the output of these commands
to a text file, checked into version control, ensures that your version
control system contains a comprehensive description of our installed environment.
This makes it easy to rollback any changes.
Abstract vs concrete dependencies
Modern python packaging specify concrete and abstract dependencies in
separate files. Abstract dependencies are loose requirements like
numpy >=0.5, which don’t include specific version numbers or dependencies. These
are easy to update, but cannot be used to deterministically recreate a
software environment. New tools like poetry
resolve abstract dependencies into comprehensive lists of dependencies with
specific versions, known as a “lock” file. The lock file gives
reproducibility, and the abstract dependencies give updateability.
No matter the tool, the workflow is the same. For instance, if you started with a clean python installation you could install the abstract requirements “numpy” and “matplotlib” with pip like this:
pip install numpy matplotlib
and then save the concrete requirements to disk like this:
pip freeze > requirements.txt
On my machine this requirements file looks like this:
cycler==0.10.0 kiwisolver==1.3.1 matplotlib==3.3.4 numpy==1.20.0 Pillow==8.1.0 pyparsing==2.4.7 python-dateutil==2.8.1 six==1.15.0
You can see that this file includes transitive dependencies of matplotlib and numpy and pinned versions numbers. Luckily numpy and matplotlib are nice libraries, and don’t have many dependencies. In this example the requirements file is the “lock” file, and the command installation the “abstract” dependencies. More advanced tools like poetry simply do a bit more automation and handholding around this two-file structure for managing python dependencies. If you learn nothing else from this blog post, please learn to save both your abstract and concrete dependencies into version control.
Dependencies are even harder to manage because changes to a computer’s installed environment cannot be easily rolled-back if something breaks. In other words, the software environment of the computer has “state” that cannot easily reproduced from a short text file. Without care your computer will become a snowflake—beautiful but unique and impossible to recreate.
The state typically takes a few forms:
- The contents of working memory (RAM) (e.g. environmental variables like PATH)
- Contents of the hard drive (e.g. installed libraries and applications)
- Services over TCP ports (e.g. running a database or webserver).
- Connections with the operating system kernel
Restarting a computer fixes so many problems because it removes a large
amount of state stored in the RAM of your computer. However, it is not easy
to rollback modifications to persistent state such as the contents of
/usr/bin. For this reason, tools that prevent the accumulation of this
state are increasingly bundled with package managers.
Generally, these take two approaches. The more common approach isolates the installed environment of an application to a sub-environment which can easily be deleted and re-built if something goes wrong. For Python projects, virtualenvs are the standard way to create such isolated environments. conda extends the virtualenv concept to include non-python dependencies while docker containers6 and virtual machines create entirely isolated computers-within-a computer. Taken to extreme, you might need to buy a new super-computer and rehire the now-retired sysadmin to replicate your collaborator’s environment.
“Functional” package managers like Nix prevent the accumulation of system
state, by making it “immutable”. For instance, instead of installing
/usr/bin/, Nix will install it to a read-only directory
/nix/store/syapqz1bwxgicwbd6dkcdlxfqn8g6din-rsync-3.1.3/bin. The random
string of characters is a hash of all the inputs required to build
which guarantees that this location will be unique. For example, a new
version of rsync will result in a different hash and new installation
Environment management tools are only useful if they can be readily and easily recreated. When did you last time you installed all dependencies from scratch? If it was a long time ago, your development is probably not very reproducible. Ideally, this should happen automatically on a regular basis. You could use continuous integration or simply delete your local conda environment on a weekly basis. For this to be convenient, you will need to automate the installation procedure by writing some text files that describe and build your software environment.
It is easier to do this when you use tools like docker that couple environment creating process with the application build/testing cycle. For pure Python projects, I recommend testing your project regularly with tox, a tool which creates virtual environments and runs tests within them.
You should strive to capture the state of all the code you use in your personal-code management system. External dependencies are code too, but are harder to manage than personal code. Git manages text files, but it doesn’t manage all the files on your computer’s hard drive or the internet. Therefore you need a tool to reliably build your program’s execution environment from a text file. Tools that bundle package and environment management features make this easier to do, and automation (e.g. continuous integration) ensures that your tool actually works.
Dependency management, software development, and science are tightly coupled. In particular, you should minimize the number of dependencies and decouple your code from those dependencies. Library developers need to be particularly disciplined since they cannot know exactly what external code will be combined with their code. While application developers (e.g. someone performing a specific scientific analysis) can choose what external code they use, they do need to worry about security, licensing, and reproducibility.
This applies to the dependencies of your dependencies as well. It is not uncommon for an open-source project to “lie” about its license; for example, by declaring a permissive license like MIT, but having a dependency containing GPL code. ↩︎
I was not really aware of this until recently. I think part of the reason is that many of us learn software development best practices by emulating what popular libraries like xarray and scikit-learn do, and these libraries don’t pin their dependencies for the reasons discussed above. ↩︎
Included as of v40.3 ↩︎
As of Jan 14, 2021, binaries from Debian Bo (1.3) released in 1997 are still being archived. ↩︎
This is true on Linux. On Mac’s docker runs inside of virtual machine. ↩︎