As I’ve mentioned before, I try my best to keep everything in my scientific workflow as clearly documented and versioned as possible1. It’s useful for me, and hopefully it will be useful to others if at some point someone has to look at what I’ve done.
But I’d like to share a bit more about how I set up my working projects, and how I’m taking a twist with git and using it directly within my scientific workflow to keep things tidy.
A supervisor of mine recommended I read an article about organizing my projects in a clear and understandable structure2. I think it presents a useful mindset about keeping data and analysis separate, documenting what you’re doing, and keeping it all under version control.
By combining ideas from that paper with a structure listed by Cookiecutter Data Science3, I came up with my own preferred directory structure for new projects.
I clearly use a lot of
I try my best so that any and every command is written down and saved in a file,
snakemake execution commands.
These will depend on your cluster configuration, or how many cores you want to run jobs on,
but all in all I try and set
Snakefiles up so that you can navigate into a specific
snakemake, and all results will be generated, just as I’ve used them in my analysis.
Documenting everything in
READMEs has a number of benefits, including pictures and markdown formatting, but I won’t go into details here.
I like this structure a lot, especially because it feeds into the whole point of this post:
By putting your results in different folders, they remain isolated from each other until you have something important to work from. You can keep you work clean and isolated by using git branches, and merge branches together when they show an interesting set of results that you want to work off of.
For those with professional developing experience, you’ve probably heard of this before. Git flow is a model of branching designed for software development4.
Software is released with specific versions from the
Major/minor updates are marked as
tags on the
master branch, but aren’t all necessarily merged into
release right away.
Work is primarily done through the
develop branch, by creating specific
feature branches that stem from the development version of the code.
Here are the lessons I want to take away from this workflow:
- Most work doesn’t need to be seen by others, because it isn’t directly relevant to the end user
- Development happens through
feature/branches, not on
- Multiple individuals can each work on multiple features simultaneously, since they don’t rely on each other
- Important features get merged back to
master, which tracks all the milestones with tags
Git Flow for Science
I’m going to tweak these takeaways to make analogous statements for scientific and computational analyses:
- Most of your scientific/analytical work doesn’t need to be seen by others, because it isn’t directly relevant to the end product (your publication) or the end user (other scientists or reviewers)
- Research happens in
results/branches, not on
- Multiple individuals can each work on multiple analyses simultaneously, since they don’t rely on each other
- Important analyses, results, and data get merged back to
master(for publication), which tracks all the milestones with tags
Here’s my adaptation of the previous figure, for scientific analyses.
As the diagram shows, you make a new
result/ branch for some analysis you want to try.
All the work and commits associated with obtaining that result belong to that branch.
If that result is useful in important, you merge it back into
research (3rd and 4th rows).
If not (which is probably going to happen more often than not), you leave the branch unmerged.
The accumulation of substantial
research milestones merit’s a tag on the
which is viewed by default when someone views your repo online or clones it.
The key point is that all your work is still there and accessible. Even if it’s not relevant to your end product, or the reviewer doesn’t care about a particular result, you can still keep track of your work, but it might just remain hidden from most viewers.
This does mean that if people are interested enough, that once you make your repo public, anyone can take a look at all the work you’ve done for that project (i.e. within that repo).
This can be a positive or a negative. For example, other interested parties who want to see and understand the most technical aspects of your work can do so freely. They can see all the pitfalls, and possibly avoid making some of the same mistakes. They can also see the full scope of your work, all the avenues you’ve explored, all the effort you put into it, and how much you had to cut away to achieve what you did
The negative example is that someone else may capitalize on some unfinished work of yours, by continuing on a branch that you left alone or didn’t consider. This could be mitigated by some git history rewriting, but that might not be the best route to take.
The reason that I like the workspace structure that I described above, in conjunction with this
branching method, is that your
results/ that are worthy enough to be included in
can be merged painlessly and easily.
Moreover, it allows others to view all of your interesting results in separate folders with clearly defined scope and purpose.
In the case that one result relies on earlier work, you’d mention it in the
README for that result
and would be able to see it in the branching history.
And last but not least, this clearly allows your code to be more replicable.
If you change computers to do some analysis, just
git pull and all your changes are synced.
If your data files are too large to sync (or you don’t want to pay for Git LFS),
Snakefiles should make it straightforward to download those files again,
and perform any preprocessing, as necessary.
Couple all that with
conda environments (see
environment.yaml in the root folder)
you can easily start working on another machine with minimal headaches.
I’ve introduced a style of using git for data and scientific analysis that fits with my overall workflow in the hopes of sharing this idea with others and/or getting feedback on it.
If this setup makes it easier for you to keep track of and document your work, it certainly will make it easier for others to view and understand your work, as well as replicate your findings.
Depsite the theoretical appeal of this method, in practice I found it toget in my way. This was because I was constantly needing to switch branches to compare results, which was something I needed to do often.
Another reason this method didn’t work was I couldn’t put all my results under version control. For example, if I produced a file that was > 100 MB, then it wasn’t easy to put under version control without adding some extra complexity to the project. When I didn’t do this, the file would still be visible after switching branches. This lead to some conflicting files and didn’t keep things clean like I wanted.
It also meant putting files in
.gitignores from other branches that weren’t supposed to exist.
This was unforunate, because the entire motivation behind using branches this way was to isolate different results within branches so that they wouldn’t leak out and affect each other.
For these reasons I actually don’t recommend this git flow science method. It was a neat idea, I think, but one that just didn’t work out in practice.