Dependency hell is a problem in bioinformatics. I’ve written before about this, with trouble installing packages and how to maintain a computational environment. There are many legitimate reasons for this complexity. However, as bioinformatics matures as a field and the software involved becomes more ubiquitous, I believe it is important for software developers and researchers to restrict this complexity, less it grinds the field to a halt.
How does dependency hell happen?
Let’s say you’re attempting a research problem and need two different tools, named A and B. Both of them rely on code from some dependency package, C. If A depends on version 1.0 of package C and in incompatible with version 2.0, and B depends on version 2.0 and is incompatible with version 1.0, you’ve hit a problem. If you can’t have multiple versions of package C installed, then there is no way to satisfy having a single software environment where you have both A and B installed.
The satisfiability of certain statements, such as “I have a single software environment with both A and B installed”, is at the heart of an interesting branch of computer science. The theory in this branch underlies modern package managers like pacman, anaconda, and apt.
You can see how even two packages can cause a software environment to be unsatisfiable. Many packages have dozens or hundreds of dependencies, and managing them all is no easy task. For the example problem above, there are two ways around this solution. One is to have separate software environments (one with A installed and another with B installed) and the other is to support multiple versions of package C to be installed. Extra software is required to do both of these things. This extra software adds an extra layer of complexity to solving your initial problem, which is probably something like “I want to use packages A and B to solve problem X”. But as you can see, to solve problem X, you need to solve the meta problem of “how can I make A and B work together”.
What problems does dependency hell cause?
For many computational biologists, this meta problem is outside the scope of their work. This is completely reasonable. You shouldn’t have to know the internals of dozens of packages to attempt some analysis. Imagine if you needed to know the precise representation of a git commit to create a repository.
Needing to solve this meta problem takes away time from solving the original problem, and in my experience, creates additional problems.
- Creating and managing delicate software environments that are hard to adapt to new problems and may not work on other computers
- Reinventing the wheel for functions they need
- Not answering their original question
If you are a someone making software for computational biologists, what can you do to make this situation better for computational biologists?
To answer this question, I’d like to start by looking at two modes that computational biologists often work in: “research” and “development”.
Research is not development
Some computational biology work is accomplished by using existing tools to solve your current problem. This is what I call “research”. You’re trying to answer a question using some computational method, like differential analysis between multiple conditions or seeing if method A is better than method B for your application.
Once you’ve identified your research problem and/or a way to solve it, computational biologists switch into “development” mode. Computational biologists craft a solution to a research problem that others can use in their own research, too. A research question could be “how can I know what part of a chromosome a DNA sequence comes from?” and the software tool to answer that question can be “BLAST”.
To see how to help computational biologists, let’s identify what “research” looks like and see if decisions in “development” help or hinder that work.
Software in biology research
Computational biologists often do the following:
- Install and use lots of software
- Work with large data files of many formats
- Iteratively design and run computational experiments
- Calculate statistics on experiments
- Iteratively create many versions of figures
From this short list, we can identify a few key criteria to make a computational biologist’s life easier:
- Make it easy to install
- Make it easy to use
- Make it fast
- Use standardized formats when possible for both input and output
- Make output formats easily ingestible for statistical calculations 1
- Reduce complexity where possible to make data visualization simple
Points 4-6 will depend on the context, but points 1-3 will be useful in all contexts. So as a first pass, software developers can make their computational biology tools good for all use cases by focusing on these three points. The rest of this post will focus on how to improve these points by focusing on two aspects of software development: lnking and language.
Make it easy to install: use static linking
Many tools bundled in operating systems have overlaps in their dependencies.
All the tools written in the C programming language, for example, will make use of the
libc standard library.
It would be overkill to have every single tool that uses
libc package their own copy of it.
It would take up valuable space that computers, especially in the early days of computing, just didn’t have.
To better distribute resources, tools are able to connect to libraries and packages, like
libc, in a special way called dynamic linking.
You still need to manage the dependencies of all your tools, but you can reduce the burden of having a lot of them.
The opposite of dynamic linking is static linking.
Static linking, in essence, copies the entire source of the dependency into the tool itself to avoid the problem of conflicting dependencies that is outlined above.
|Resource allocation when multiple processes use the same library||Dynamic||Shared resources|
|Compilation time||Dynamic||Less code to compile|
|Compiled application size||Dynamic||Less code to compile|
|Security patches||Dynamic||Only need to patch the library, not every program that uses it|
|License violations||Dynamic||No accidental packaging of software with a conflicting license to yours|
|Dependency hell||Static||Correct version guaranteed|
|Development glue code||Static||No special code for checking or accessing shared resources|
|Code locality||Static||Only referencing the required functions or variables, not accessing the entire shared resource|
|Installation||Static||Compile once for an architecture, then distribute. Dynamic linking more often has constraints on local libraries|
How do these scenarios relate to bioinformatics? If dependency hell and installation are major problems, then clearly static linking is the winner. Moreover, in the computational biology “research” context, many of the scenarios where dynamic linking is better simply don’t apply.
Many bioinformatics tools are run as a single process. For example, you won’t have thousands of alignment processes running at the same time. So shared resource allocation is not an issue. Research involves using tools, not compiling them, so compilation time it not an issue. Computational biologists routinely handle multi-GB files, so reducing your program’s size by 20-30 MB will have a negligible effect. Security concerns in bioinformatics primarily come from data access, not active processes themselves, so security patches are typically not an issue. And finally, most research happens in an open setting where code is shared and freely licensed. There is rarely proprietary code in a “research” context, so license violations are not a major issue.
In summary, to make your software easy to install, use static linking by default. Are there exceptions to this guideline? Of course. But making developers create easy to install software by default is better for computational biologists.
Make it easy to use: create binaries
This decision is closely related to using static linking.
Compiling a program into a single binary executable file makes it more portable and easier to use in the command line.
You don’t have to wonder about how much memory to allocate with a
java -Xmx command.
You don’t have to specify the exact path with a
python /path/to/random/script.py command.
You don’t have to ensure that all the related libraries are in their correct paths with an
Rscript /path/to/other/script.R command.
Providing a single entrypoint via the command line with a binary executable makes using the tool easier 2.
Installing the binary to a folder in your
PATH also makes this easier, as does adding a command line interface and some documentation.
Make it fast: use low-level languages
In a “research” context, there are a lot of exploratory analyses, test cases, and back-of-the-envelope calculations. Being able to get an answer quickly and easily is paramount. If you use a low-level language you’ll spend more time thinking about the engineering of your problem than the problem itself. This is the critical point to consider when picking a language and can explain the dominance of high-level languages like Python and R in “research” contexts.
This decision is a trade-off. What high-level languages give on one hand with simplicity, they take away with slower and less efficient calculations on the other. For computational biologists, this speed and efficiency is paramount for answering their research questions quickly.
In a “development” context, your focus is less about exploring the space of a research problem and more about solving a particular problem with identified solutions. You main task is to engineer a good solution. Choosing a low-level language, then, makes much more sense in this context.
As you can see, the application scope is the main determining factor here. Research? It’s probably easier to build a prototype and explore your problem in a high-level language. Development? You know what you’re making and want to make it fast when others use it, so focus on a low-level language. If it needs to be a library, and not a binary, you can develop in your low-level language and provide a foreign function interface for your language of choice.
This is what many scientific computing libraries, like NumPy and DESeq2, do. The computationally-heavy tasks are mostly written in C, then interfaces are written in Python and R, respectively, to make those functions available in the higher-level language. C, Fortran, and Rust are often used like this because of their efficient computations and compatibility with high-level languages.
Bioinformatics is a unique ecosystem. It bridges biology, software engineering, data science, and statistics, borrowing both boons and baggage from each field. In my opinion, the best way that developers can make software for bioinformatics is to make statically linked binary applications by default. There are, of course, reasons not to take this approach. But by defaulting to this position, developers can make the lives of computational biologists around the world using their tools much easier.
This conclusion has many impacts for developers:
- Prioritize programming languages that compile to binary executables. Examples include C/C++, Rust, and Julia.
- Non-compiled languages like Python and R should be only used for research purposes or to wrap foreign function interfaces for those languages. These should not be used for development, if possible.
- Commercialization of research methods and tools should still make use of compiled executables, but dynamic linking should be considered.
In summary, you can use this workflow for picking a language to use for your problem. This is based on the arguments above and is, in my opinion, a good first set of questions to ask yourself when starting something new.
Read: tables ↩
The main exception to this guideline is that libraries are often useful in research contexts. Take for example, the GenomicRanges library. If you want to do anything in computational biology in R, you’ll likely need this library. Since its explicit purpose is to be a library, it doesn’t need to be a binary. But if you look at its source code, you’ll see much of it is written in C. See my argument in the next section about low-level languages. ↩