Anyone who’s done data science or bioinformatics knows the headaches involved with programming languages, software environments, packages, and dependencies. It’s not easy, so it’s no surprise people struggle with reproducing computational/statistical work done by other groups. Between operating systems, libraries, and package versions, there are a lot of issues that can make it difficult, if not impossible, for others to use your code and reproduce your work.
Here, I’m going to enumerate the ways that I maintain a consistent and transferable software environment, and how it helps me in my research.
Anaconda1 is a popular data science platform whose main purpose is to reduce overhead so you can focus on your analysis. They do this by letting users:
- Make siloed and customizable computational environments
- Make these environments easily transferable
- Take the headache out of managing dependencies
The main reasons I like using Anaconda are:
- Availability of languages
- Environments work across all OS’s
- Environments can be defined entirely by a single config file
- It manages package dependencies on its own
- Anaconda can be installed on clusters without root access
Languages & Packages
Do you have a favourite programming language?
Do you get into intense debates over Python 2 vs 3?
Well no worries, cause with Anaconda you can have
it all most of it.
Whatever your opinion, Anaconda runs on both Python 2 and 3, and you can use either to set up environments that use the other version of the language.
Anaconda1 contains packages for installing and using different programming languages, like R, Ruby, and Perl.
You can check the versions available by running
conda info <package>.
This is great because people rarely use the same version of a language, and if you’re looking at analysis from a few years ago, it’s easy to set up a new environment, install the version of the language used at the time, and run whatever you need to.
The same is true for all other packages, not just languages. If one paper uses a particular version of Pytorch2, for example, you can install it and use it without affect your favourite version you use in your projects.
Environments can be exported to a YAML file so that all the dependencies and versions of installed packages are specified exactly.
This makes it incredibly easy to maintain environments across computers, since all you need to do is
conda env export -f env.yaml on one computer, and
conda env create -f env.yaml on the other.
Including your environment file is also an excellent practice when publishing. Not only do you explicitly list all the packages and versions you’ve used, it also makes it easy for others to pick up where you left off, and not worry about what machine you ran your analyses on.
Drawbacks of Anaconda
Anaconda is great, but it isn’t perfect. By its nature, you can only install packages that are hosted in Anaconda repositories, which means the version of a package you want or need to use might not be available. This is the biggest issue, for me, but it’s not something that I run into often.
NB: There are ways around this, but I’m not going down that rabbit hole here
Anaconda also takes up a lot of space (the installer is ~ 500 MB), so depending on your storage allotments this may or may not be feasible (ex. limited shared storage space on a cluster).
Snakemake3 is a suite of tools used for creating “reproducible and scalable” workflows. It runs on Python 3 (sorry Python 2 users), and is all about simplifying the commands you run in order to generate specific files.
Workflow steps are denoted as “rules” and are placed in a
The main benefits of Snakemake, to me, are:
- Predefined rules make performing the exact same commands very straightforward
- You can generate a workflow diagram, and the
Snakefileexplicitly lays out all of your steps
- Every step is specified in a text file, which can be easily shared
- Snakemake works well with clusters
Snakemake is available from Anaconda, so making use of both of these together allows you to completely specify your environment, and know that your workflow has access to all the resources it needs.
My main use case is for preprocessing data. I preprocess raw data in very consistent ways for different datasets, so having these rules simplifies my personal workflow before I start doing analyses.
It also helps if you’re performing the same analyses on different data, but I haven’t used it as much in this context.
You can specify as many rules as you want in your
If you want to define your entire pipeline, from downloading raw data to figure generation, go right ahead.
It can get a bit complicated and messy, though, so you have to find the right balance.
My main gripe with Snakemake is that because steps are based on output file names, consistent file naming when you have to do similar-but-not-quite-identical processing can get a little hairy. If you have some if-then case depending on the underlying data, but want to keep names consistent, it’s tedious. You can define explicit rules or put constraints on certain rules to avoid this, but like I said above, it’s about finding the right balance of simplicity and usability.
I consider code work a lot like HR documentation: if it’s not written down and documented, it didn’t happen. Readme files are particularly useful for documenting what you’ve done and why.
Python, R, or Bash scripts serve there purpose, but anything you don’t run in a script (simple command line statements, for example) should be documented somewhere, in my opinion. This is also a good place to give some context for what you’re doing. A sentence or two can really help future you come back to your work and understand what you were trying to accomplish.
Version control for data science is a must, in my opinion. I’ve given a couple presentations on its use and necessity, if you’re interested4. I version all of the code I execute and small output files, where appropriate.
It doesn’t really matter if you use GitHub, BitBucket, or some other VC tool, but using one nicely combines all the previous topics together.
Readme files are automatically rendered when viewing the repo online, which makes readability much easier. The Anaconda environment can be exported to a YAML file and versioned with your project’s repo, as well as the Snakefile. You also then have a history of all the work you’ve done in its various forms.
These are all excellent things that allow you to share your work with collaborators, and the broader scientific community.
All of the above make it easier for you and others (as well as future you) to look at what you’ve done, understand it more easily, and reproduce it if necessary. This is by no means the only way to achieve a consistent and reproducible computational environment, but it has a lot of pros, in my opinion.
I’ll naturally toy with my setup to see how to improve it, but I’ve spent enough time experimenting that I feel relatively confident in this setup.
Hopefully this will be a useful resource for others, and I’m always happy to hear alternative suggestions.