Pragmatic guidelines for bioinformatics software tools

Published: May 27, 2019   |   Updated: June 15, 2019   |   Read time:

Tagged:

Designing robust, understandable, and efficient software tools for bioinformatics is has its challenges. There are numerous excellent articles on how to design and manage good software, so I won’t repeat those points here (see Taschuk and Wilson, 20171 and Lee, 20182, for example). To further this dialogue, I would like to raise a couple points that I rarely see mentioned in these types of discussions. These guidelines are specifically for the internal workings of your tool and the parts of it that are visible to the end user, not on how to package, distribute, or maintain them.

Before we being, I’d like to explicitly state a premise of this article: bioinformatics happens in the command line.

Lots of exploratory work and finishing touches happen in Rstudio or Python’s interactive mode. But for computational analyses and (pre)processing, this is almost exclusively done from the command line. Running scripts and tools, passing input files, and aggregating steps as part of a pipeline all happen from the command line, not through interactive sessions. Because of this, unless on is developing a library for a given language, software should be made with this mentality in mind, and stem from the fact that this is how users are going to interact with your tools. This premise unerlies the following guidelines, so its necessary to make that explicit as we go through the following points.

Install your tool as a command line utility

Tools designed for a specific purpose will be primarily used from the command line or as a function in a language, like Python or R. If you’re developing a library for a specific language that relies on language-specific data structures, then you can likely ignore this guideline.

But if you’re developing a tool for a purpose that isn’t dependent on a specific language, like converting between formats or performing calculations form BED files, it doesn’t matter whether the end user uses the language you designed the tool in or not. What matters is that the tool is easy for the end user to use, and it’s always easy for an end user to run a single command from the command line.

An easy stance to take, from the developer’s viewpoint, is to tell the end users to run something like python <script> <params> from the command line. This does exactly what is mentioned above, and this is a common occurrence for bioinformatic tools. But this implementation suffers from the fact that you have to reference the file where the code is, instead of just running the tool. It puts the emphasis on the code itself instead of on performing the step that the code is programmed to do. This also makes it difficult to integrate with pipelines or cluster computing since you always need the file present. Creating a command line utility for your tool shifts the focus back to running the tool, and allows the user to access it from anywhere (as long as it’s listed in your PATH).

Good example: Bowtie23. It is designed to align reads to a reference genome, which is a purpose that is larger than that of a specific language. Thus, it makes sense to create a command line tool.

Install your tool as a language-specific library, too, if possible

The emphasis of your tool should be on its results and making it easy to run. If you’re developing a library for a specific language, this means things like making your inputs and outputs compatible with other functions and data structures in the language and providing clear documentation. If you’re developing a command line utility, as suggested above, it’s not necessary to do these things for the language that you code in.

However, doing this is recommended. Being able to create both a command line utility and a package library in the design language gives end users the flexibility of interacting with your software. This allows for easy experimentation and tinkering, as is often necessary in science.

Python packages, for example, can accomplish this quick nicely using entry points4. Entry points are functions that are callable from the command line and serve as an entry (hence the name) into the processing that your software tool will perform. Entry points are specified with a single line in the setup.py file in your package, which makes them extremely easy to create.

Good example: MACS5. It’s a Python package whose entry point is the macs2 command that you run from the command line. It’s also a Python package that can be imported using import MACS2.

The command line entry point maps to the argument parsing function that then redirects to the approriate computational functions as needed. This means that all the processing is done using the functions written in Python, but the entry point serves as an easy wrapper for end users to deal with.

Expose minimally important steps

Scientists and engineers tinker with systems to see where improvements can be made and where problems can arise. Allowing end users to test different components of your tool is a necessity as this is how science moves forward. This means that all-in-one tools that operate as a black box, while simple to run, are difficult for end users to trust and integrate as a part of their work unless they use every single process in your tool as you’ve created it.

Most functions you write will not be important for end users to see, but there are some major steps that will be important for end users to work with. Think of the major processes that you would display in a workflow describing your program. Each of these major steps is necessary to accomplish your tool’s overall goal, and it’s often useful for end users to run these steps individually.

Consider exposing these internals via functions (for a library) or subcommands (for a command line utility). This suggestion also extends the points made by the previous two suggestions in giving your end users flexibility.

Good example: HiCUP6. The hicup command can run every single step in its pipeline (truncation, alignment, read pairing, filtering, and deduplication), but each component can also be run separately as its own tool. This gives end users choice in how they want to analyze their data, while still offering them a simplified version without needing to know about all the cogs in the machine.

Avoid configuration files for specifying parameters

Many bioinformatics tools require parameters to run. To handle all these parameters, many developers opt for using configuration files. While useful, they have some serious drawbacks for both developers and end users.

The most important drawback for developers is that config files require lots of text parsing. All the minutae associated with parsing this file and storing each option properly in an apporpriate data structure is very easy to get wrong and takes a lot of time to get right.

For the end user, it’s difficult to learn the precise format that the developer expects the configuration file to be in. Parameter names are often spelled incorrectly, leading to lots of trial and error for just running the tool in the first place. It also means that if the user wants to slightly change the run command, they have to edit the configuration file or create a modified duplicate of it. This doesn’t mesh well with pipelines or steps that need to be run on a bunch of samples.

These challenges for both parties can be mitigated by adhering to YAML or JSON input formats, but should largely be avoided in favour of using command line argument parsers. Argument parsers are extremely effective at handing command line arguments, offer lots of flexibility, and are very well used. Some good examples are:

Good bioinformatics tools balance the tradeoff between having enough parameters to remain flexible for different applications, and not having too many to overwhelm new users.

Good example: HiCUP6. Each step in this Hi-C preprocessing tool can make use of either command line arguments or a configuration file, and there are a small number of parameters for each step.

Show measures of progress for steps that take more than a few seconds

If you run a command and the terminal is unresponsive, it’s difficult to know what’s going on. Are you loading a large data file and it’s taking a lot of time? Are you stuck in a loop and you don’t know when you’ll get out? Is the step waiting on user input to proceed? Or is everything running just fine?

This isn’t a concern if the step finishes in a few seconds, but is more and more of a concern as time goes on. Since bioinformatics often deals with files that are GBs in size, these long waiting times are common. Take advantage of STDERR and STDOUT to provide the end user with information about how the step is performing, to let them know that everything’s ok, to track the step’s progress, or to inform them of what error has occurred.

Good example: Sambamba7. All of its subcommands have a -p flag that allows you to monitor the command’s progress, if you wish to.

Include options for verbosity

While showing measures of progress can be informative, overwhelming end users with a deluge of information makes it difficult to make sense of anything. A happy medium can be struck by making use of verbosity parameters in functions or as command line arguments. Using a discrete scale from 0 (completely quiet) to some N (extremely verbose) allows end users to select how much information they want to see, while not ignoring the previous tip.

This is also extremely useful for you, the developer, during the course of debugging or adding improvements.

Good example: conda env export8. Using the -v, -vv, and -vvv flags, end users can specify the level of granularity they want to see in output logs.

Inputs and outputs should be stored in field-standardized formats

Bioinformatics is a series of operations on input files that are outputs from other steps9. Automating the process of shuttling data along form one step to the next is always important, and one of the easiest ways to do this is to ensure file formats adhere to well-established standards. This means representing data in tables with easily-understood headers, BED files, MatrixMarket files, BAM files, and more. Try to avoid using ad hoc formats, since they’re difficult for others to understand and limit your tools interoperability with other tools end users may want to try (this isn’t to say that new formats aren’t sometimes necessary)10.

Good example: HiCUP6. All of its tools produce FASTQ, BAM, or BED files. These are all well-established and universally-used file formats. Using these as inputs and outputs ensure maximum interoperability with other tools.

Bad example: Juicer11. For specifying restriction digest sites, its online documentation12 specifies that genomic positions should be listed as space-delimited start positions, with the first number being the chromosome. This is a bad format for a variety of reasons, the first of which being that there is already a univerally-accepted file format for specifying genomc positions — the BED file. Secondly, this new restriction site format has lines of variable lengths, making it difficult to query and parse with most programming languages. Thirdly, the size of the restriction site is dependent on the restriction enzyme, which is information that is kept separately from this restriction site file. You can’t look at that file and know where in the genome a restriction site will cut; you can only know explicit locations after performing calculations on the information you’re provided by incorporating external information. All in all, this makes it difficult to use effectively for anything outside of the Juicer set of tools.

Output files should have static and predictable file names

The previous section’s point about using a series of operations remains important here, too. Tools become cumbersome and difficult to automate when the end user has to manually specify which files to use for what step. Tools that provide options to specify a file’s output name (e.g. --output), output directory (e.g. --outdir or --output_dir), or output prefix (e.g. --prefix) means that end users know precisely what output files will be produced and where. If your tools produces multiple output files instead of one, be sure to explicitly list all the suffixes or output file names that will be produced in the tool’s documentation.

Avoid forcing the user to adhere to a directory structure that you yourself have used. For example, don’t expect that a user will put all of their input files in the Input/ directory and all output files will be produced in Output/. Often, end users have input files scattered in a variety of directories, and copying or linking to them becomes tedious and disruptive to the way they’ve structured the data for their projects. Implementing this flexibility in your code does take some effort, but this relatively small endeavour lightens the load for the end user, drastically.

Good example: HiCUP6. In its online documentation13, it explicitly states the output file names for each step given the input.

Bad example: HiCUP6 (sadly, it’s not perfect). Summary reports for each step have a timestamp in their file names, and well as an alphanumeric ID that isn’t related to the input file name. While the important BAM files can be easily piped to other processes, summary reports can’t be easily manipulated in an automatic way without the end user spending some time creating some complicated regular expressions.

Parallelization is your friend, but it’s difficult

Given that genome sequencing files are often on the order of GBs, it can take a long time to process any individual file. Parallelization of your steps can reduce the time to completion by a full order of magnitude, vastly improving the end user experience.

However, many developers avoid parallelization because it’s difficult to implement and get right. This isn’t a comment on the level of software development in the bioinformatics community. It’s simply a fact that even experienced software developers find it difficult to properly implement threading, handle asynchronous calculations, and allocating resources effectively.14

Parallelization shouldn’t affect the underlying mathematical properties of your code, and are almost always and engineering feature instead of a scientific feature. So while useful, implementing parallelization in your code should be left to later developmental stages, and typically only when necessary. It’s also often a good idea to consult an experience software developer to design how this code is going to behave.

Good example: Sambamba7. Not only are their native algorithms more efficient than Samtools15, they add an extremely simple interface for enabling parallel processing: the -t option. This is an extremely good implementation of parallelization that makes it very easy for end users to use, while delivering considerable speed-up.

Make Python 2 packages compatible with Python 3

If you’re developing in Python, you’ll know that there’s a division in the community about using Python 2 and Python 3. While there are legitimate points of contention here, the fact remains that Python 3 was released in 2008 and Python 2 will become deprecated in 2020, and package managers like pip will eventually drop support for Python 2.

Forcing users to stick with Python 2 imposes constraints on the end user, and often requires that they migrate their entire set of tools to Python 2, or go through the process of having a separate environment for your tool. It’s a frustrating experience, to say the least, to manage all of those dependencies for each small step in whatever the end user is trying to accomplish.

Sometimes a crucial package you rely on is only available in Python 2, and it’s out of your control. It’s unfortunate, but it happens. But if you can avoid it, it makes life for the end user much easier.

Bad exaxmple: MACS5. To be fair, this tool was first released in September of 2008, before Python 3 was released. But version 2 of this tool is widely used in ATAC-seq and ChIP-seq analysis, and has been actively developed since its initial release (the latest release, at the time of writing, was October 17. 2018). It’s long past due that this package be Python 3-compatible, but understandable that the engineering costs to making this switch can be considerable.

Conclusions

In summary, there are a lot of factors that go into designing scientific software. Developing software that is flexible, efficient, rigorous, and novel is no simple task, even for experience software developers.

The guidelines suggested above stem from the idea that end users should have the freedom to interact with your software tools as they see fit and shouldn’t be burdened by your design decisions. Some of these guidelines are more easily attainable than others, but all of them can help improve the end user experience of your software.

Repeatedly mentioned tools like Bowtie2 and Sambamba are extemely good examples of a variety of these guidelines which explains why they are so widely lauded among the bioinformatics community.

Many of these guidelines extend beyond bionformatics and into other fields of scientific programming, as well. Hopefully these guidelines can help further the discussion on what constitutes good scientific software and can aid developers in their design decisions.

Please comment or reach out to me if you’d like to discuss. Feedback on these ideas is always welcomed.

Update: 2019-06-15

For more extended discussion on command line argument parsing, see the Hoffman Lab’s Application Command Line User Interface Checklist16. There’s a thorough checklist in there of good practices to follow that extend some of the topics I’ve discussed above.

References & Footnotes