Git flow for scientific analysis

Published: May 11, 2018   |   Updated: January 17, 2021   |   Read time:

Tagged:

As I’ve mentioned before, I try my best to keep everything in my scientific workflow as clearly documented and versioned as possible1. It’s useful for me, and hopefully it will be useful to others if at some point someone has to look at what I’ve done.

But I’d like to share a bit more about how I set up my working projects, and how I’m taking a twist with git and using it directly within my scientific workflow to keep things tidy.

Workspace Structure

A supervisor of mine recommended I read an article about organizing my projects in a clear and understandable structure2. I think it presents a useful mindset about keeping data and analysis separate, documenting what you’re doing, and keeping it all under version control.

By combining ideas from that paper with a structure listed by Cookiecutter Data Science3, I came up with my own preferred directory structure for new projects.

Directory structure -80%

I clearly use a lot of READMEs and Snakefiles.

I try my best so that any and every command is written down and saved in a file, other than snakemake execution commands. These will depend on your cluster configuration, or how many cores you want to run jobs on, but all in all I try and set Snakefiles up so that you can navigate into a specific results folder, type snakemake, and all results will be generated, just as I’ve used them in my analysis.

Documenting everything in READMEs has a number of benefits, including pictures and markdown formatting, but I won’t go into details here.

I like this structure a lot, especially because it feeds into the whole point of this post:

By putting your results in different folders, they remain isolated from each other until you have something important to work from. You can keep you work clean and isolated by using git branches, and merge branches together when they show an interesting set of results that you want to work off of.

Git Flow

For those with professional developing experience, you’ve probably heard of this before. Git flow is a model of branching designed for software development4.

Git flow model -80%

Software is released with specific versions from the release branch. Major/minor updates are marked as tags on the master branch, but aren’t all necessarily merged into release right away.

Work is primarily done through the develop branch, by creating specific feature branches that stem from the development version of the code.

Here are the lessons I want to take away from this workflow:

  1. Most work doesn’t need to be seen by others, because it isn’t directly relevant to the end user
  2. Development happens through feature/ branches, not on master or release
  3. Multiple individuals can each work on multiple features simultaneously, since they don’t rely on each other
  4. Important features get merged back to master, which tracks all the milestones with tags

Git Flow for Science

I’m going to tweak these takeaways to make analogous statements for scientific and computational analyses:

  1. Most of your scientific/analytical work doesn’t need to be seen by others, because it isn’t directly relevant to the end product (your publication) or the end user (other scientists or reviewers)
  2. Research happens in results/ branches, not on master or release
  3. Multiple individuals can each work on multiple analyses simultaneously, since they don’t rely on each other
  4. Important analyses, results, and data get merged back to master (for publication), which tracks all the milestones with tags

Here’s my adaptation of the previous figure, for scientific analyses.

Git flow for science model -80%

As the diagram shows, you make a new result/ branch for some analysis you want to try. All the work and commits associated with obtaining that result belong to that branch. If that result is useful in important, you merge it back into research (3rd and 4th rows). If not (which is probably going to happen more often than not), you leave the branch unmerged. The accumulation of substantial research milestones merit’s a tag on the master branch, which is viewed by default when someone views your repo online or clones it.

The key point is that all your work is still there and accessible. Even if it’s not relevant to your end product, or the reviewer doesn’t care about a particular result, you can still keep track of your work, but it might just remain hidden from most viewers.

This does mean that if people are interested enough, that once you make your repo public, anyone can take a look at all the work you’ve done for that project (i.e. within that repo).

This can be a positive or a negative. For example, other interested parties who want to see and understand the most technical aspects of your work can do so freely. They can see all the pitfalls, and possibly avoid making some of the same mistakes. They can also see the full scope of your work, all the avenues you’ve explored, all the effort you put into it, and how much you had to cut away to achieve what you did

The negative example is that someone else may capitalize on some unfinished work of yours, by continuing on a branch that you left alone or didn’t consider. This could be mitigated by some git history rewriting, but that might not be the best route to take.

The reason that I like the workspace structure that I described above, in conjunction with this branching method, is that your results/ that are worthy enough to be included in research can be merged painlessly and easily. Moreover, it allows others to view all of your interesting results in separate folders with clearly defined scope and purpose. In the case that one result relies on earlier work, you’d mention it in the README for that result and would be able to see it in the branching history.

And last but not least, this clearly allows your code to be more replicable. If you change computers to do some analysis, just git pull and all your changes are synced. If your data files are too large to sync (or you don’t want to pay for Git LFS), the READMEs and Snakefiles should make it straightforward to download those files again, and perform any preprocessing, as necessary.

Couple all that with conda environments (see environment.yaml in the root folder) you can easily start working on another machine with minimal headaches.

Conclusion

I’ve introduced a style of using git for data and scientific analysis that fits with my overall workflow in the hopes of sharing this idea with others and/or getting feedback on it.

If this setup makes it easier for you to keep track of and document your work, it certainly will make it easier for others to view and understand your work, as well as replicate your findings.

Update: 2021-01-17

Depsite the theoretical appeal of this method, in practice I found it toget in my way. This was because I was constantly needing to switch branches to compare results, which was something I needed to do often.

Another reason this method didn’t work was I couldn’t put all my results under version control. For example, if I produced a file that was > 100 MB, then it wasn’t easy to put under version control without adding some extra complexity to the project. When I didn’t do this, the file would still be visible after switching branches. This lead to some conflicting files and didn’t keep things clean like I wanted.

It also meant putting files in .gitignores from other branches that weren’t supposed to exist. This was unforunate, because the entire motivation behind using branches this way was to isolate different results within branches so that they wouldn’t leak out and affect each other.

For these reasons I actually don’t recommend this git flow science method. It was a neat idea, I think, but one that just didn’t work out in practice.

References & Footnotes