Science has always been a data-driven pursuit. The formulation of hypotheses, the execution of experiments to gather data, and the analysis of that data have always been at the heart of scientific ventures. But today, more than ever, scientific data are abundant. From high-throughput measurements about genomes, to behavioural data collected by technology companies such as Facebook and Uber, to particle accelerator experiements, many scientific disciplines are no longer experiencing a constraint of available data1.
To deal with this over-abundance of data, software is becoming an increasingly important pillar in the foundation of scientific discovery. From hypothesis generation, to data collection, data curation, analysis, and visualization, scientific software is used throughout all avenues of science, increasingly in all fields. Because of this it is worth evaluating how scientific ideals impose certain constraints on scientific software and understand the implications this has on the software itself.
Specifically, I want to describe how scientific ideals necessitate that scientific software to be open source, which places it squarely in the economic realm of a public good. This public good property of software within the sciences, coupled with the novelty factor desired for publishing journal articles, creates an economic structure that incentivizes ad hoc and short-term software development at the cost of long term software stability and best practices.
The implication of this economic structure, if correct, is that unless this stucture is altered (through funding agencies allocating funds for software maintenance or scientific merit with respect to software development is judged differently, for example), scientific software will become an increasingly unstable pillar in the foundation of scientific research. This is related to ongoing issues withing science such as the replication crisis, and, if left in its current state, will only exacerbate these issues.
Fallibility as a foundation for trust
There are many important ideas that lie at the heart of all scientific inquiry. An ever-important one is that humans are fallible; do not rely on what they claim, instead, rely on the evidence that is presented. This is why scientists write articles, share their hypotheses, and, more importantly, their data to support or go against certain ideas. Descriptions of what people observe are useful, but often unreliable on their own, even if what is being said is true. The best way to alleviate concerns about unreliability is to share your work as openly as possible. If people care about what you are saying and want to know more, they should be able to independently verify what you claim to have observed.
Computer code, coincidentially enough, is perceived similarly from software developers23. There is no error-free software, only sufficiently-reliable and maintainable software. Humans are fallible in all regards, and software is written by humans and executed by machines that humans built. But again, much like scientists relying on evidence instead of claims, developers don’t need to rely on what humans say a piece of software does. They can write tests to check that the code does precisely what one would expect it to. If not, they can investigate to find out why the software is not behaving as expected, and fix it.
In both of these domains, individuals can, to whatever extent they want, gather evidence until they are convinced of the ideas or tools being presented. This is true even if one starts from an inherently untrustworthy set of claims. This shift from trusting people to testing ideas is a logical abstraction that builds bridges out of rubble. And this investigative process is what science is all about. We take nothing for granted and try to build a sensible model of the universe around us through interrogation, while giving others the freedom is disregard us if they so choose.
Open source code is a necessity for science
Being able to judge what calculations another scientist has performed is vital for ensuring that their work is legitimate. While calculations may have been long and laborious in the past, calculations have typically been of the form of “difficult to calculate, easy to verify”. But with computers playing an increasing role in data collection and processing, tables of processed data can be gigabytes in size, containing thousands to billions of data points. Given that all code contains bugs of some kind, it becomes increasingly difficult for the original scientist to check that no errors were incurred during calculations that would affect the end result. Moreover, it is increasingly infeasible for other scientists to independently recompute these results with a generic, but not specific, description of what calculations were performed. Small differences in implementations, especially surrounding code that involves random values, such as clustering or simulations, can lead to large differences in outputs that are difficult to diagnose. This effect is further compounded in cases where access to raw data should be legitimately withheld from researchers, such as information containing sensitive or private information of study participants.
Data transparency, alone, is thus not sufficient to ensure that the science is being communicated properly. Data transparency, alone, allows for explicit sharing of results, but not necessarily “replicable” or “reproducible” research by others. If you, as an independent researcher, cannot validate the data in a meaningful way, you cannot rely directly on the evidence presented. This cuts at the heart of the previous section and erodes your ability to rely on evidence, shifting the allocation of trust back in the direction of the original researcher.
Making analytical code and software open source4 can restore some of this balance. Researchers don’t have to rely on the methods being described, but can look at the actual code that was run to generate results5. This allows others to spot bugs the original researchers might have missed, identify spots where model assumptions might be incorrect, or to identify any suspect portion of code that may lead to undesired results.
Were outliers removed from the dataset before analyzing? Was the population standard deviation used in the hypothesis test or was it the sample standard deviation? Were the simulations results only true for a particular random seed, or do the results change drastically if a different one is used? Were these matrix products calculated correctly, or was there an overflow error due to the size of the matrices? Were missing values handled properly, or did they propogate further through the calculations? Without access to code, this is near impossible.
There is also a more nefarious effect from not open sourcing scientific code. It offers researchers an avenue to fabricate results to better support their hypotheses while hiding evidence of tampering from others. In the biological sciences, many researchers have gone to great lengths to alter gel images to fabricate results for their work. This has caused many journals and researchers to take part in an arms race to identify these errors and take action against those who submit articles with tampered images6. Closed source software can provide the same ability to fabricate results, without producing traces of tampering that are easy to follow. It is also much harder to see that anything is wrong in the first place beacuse of the size of the data that is processed in many fields.
Without access to source code, the results are not able to stand on their own merit, they rest on the trust between researchers. This trust, while often socially acceptable, is an unstable foundation upon which to build future research, due to the facts that all humans are fallible and no code is error-free.
Scientific software is a public good
Here is where we pivot from “what science and scientific software is” to “what affect does this have on the scientific ecosystem?”. Software code, by its very nature, is non-rivalrous. The fact that I have the code itself does not restrict you from having it. And the requirement for open source scientific software, as detailed above, also makes it non-excludable. I cannot restrict you from accessing my scientific code unless I want to sully the scientific relationship between us. All are free to copy, investigate, modify, and reproduce the code I publish.
These two properties of scientific code, it being non-rilvarious and non-excludable, means that it is a public good, in the economic sense.
Open source software, more generally, is also a public good. As such, funding for open source software has been a problem for as long as open source software has been around. The inability of free markets to adequately price public goods, the free-rider problem, and the “opportune parasite” problem7 continually plague discussions about funding open source projects. These discussions around how to create companies and products that are public goods while creating moral and sustainable funding structures are rapidly circulating in the field of software development8, but I have yet to see these ideas percolate into the sciences9.
One reason why this discussion may work better in the sciences is not that the goals or motivations are any different, but that the funding sources already appreciate the importance of creating public goods. Venture capitalists fund projects that are likely to produce a good return on investment, where the return is money, which is explicitly not a public good. But for scientific funding agencies, the return is knowledge, which is a public good. Scientific funding organizations do not have the expectation that they will be compensated in the future by the fruits of the labour they fund. This is precisely what venture capitalists don’t, or can’t afford to, understand10.
But the discussion around how to prioritize and fund scientific software is increasingly important. Next, I want to sketch out the severe risk that the scientific ecosystem runs by avoiding this conversation, namely, how it will slowly erode the foundation that scientific discoveries build.
A marketplace of half-baked ideas
Poor code and software tools are undeniably a result of having inexperienced programmers. Inexperienced programmers are common in the sciences because of their interest in the scientific field itself and less about software and good data curation techniques.
Until this point, I have used “software” and “code” almost interchangeably. There is a subtle difference between these two with how researchers use them. Code is the list of things that have been done, but software are the tools that researchers install and use to perform common operations. The difference between them raises different threats to the longevity of scientific discoveries.
Ask any computational biologist what the state of bioinformatics software is like. It’s routinely bad for a large number of reasons. Trying out a new tool is often a bad idea because of the format conversion, installation1112, and troubleshooting (FIT) problems, and, in general, the amount of time dedicated to getting a tool to working properly in the first place. And this is all before you know if this tool can even solve the original problem you want to solve. The continued poor state of affairs over decades is evidence of a lack of incentives that prioritize decent quality software that is a) easy to install, b) works correctly for most relevant cases, and c) easy to use (ICU conditions).
This is not to say that scientists are not trying to counter this phenomenon or have had no success. There are clear exceptions to the claims above, some of which I have written about previously. The excellently-managed Bioconductor, bioconda, and conda-forge projects house thousands of high quality bioinformatics software packages, databases, and annotations that are free to use and often work together. The infrastructure that these projects provide for modern biomedical research cannot be over-stated13.
These projects can help tackle the “easy to install” problem with software (although “dependency hell” within many packages is common), but they do not mean that tools are “easy to use” or that they “work correctly for most relevant cases”. These latter qualities are issues of engineering, not science, and quality engineering comes at a price. Developing, acquiring, and keeping software engineering expertise is something that scientific institutions need to fund. Otherwise, graduate students and post-docs will waste their time re-inventing the wheel because the last wheel that somebody used “worked on my computer”.
Evidence of this wasted time may be found in increasingly long PhDs and reliance on “custom scripts” to handle data pre-processing in publications. Worse yet, scientific projects that generate large datasets will become increasingly difficult to validate, not for transparency reasons, but due to the exponential time investment required to get the half-baked software to work in the first place.
While there will always be high quality software examples to point to, practical experience in this field for years has shown me that these examples are few and far between. Much of my time using bioinformatics software is spent in the ICU dealing with software FITs. General open source software at least has the advantage of crowdsourcing software developers from around the world. But scientific software, being sufficiently niche in nature, cannot afford the same “free rider” calculus that non-scientific open source software can. It is my current opinion is that these successful software tools and repositories have succeeded not because of the incentive structures within academia, but in spite of them. The small number of software tools that can claim to meet the ICU conditions are the exceptions that prove the rule. Software will be made, papers will be written, institutions will talk about all the software that they are releasing for the community. But no one will use them because they don’t work and no one can be incentivized to fix them. The proliferation of half-baked software, stemming directly from a poor incentive structure, will only exacerbate the problems felt by scientists today.
I have mentioned earlier the downsides to not open sourcing scientific code. But poor quality code has its own dangers. All of the issues with software above exist with code, but often in the context of exploring published results instead of producing novel ones.
The novelty factor required to publish new papers in well-read journals increases the likelihood that corners will be cut in creating and documenting code for papers. It simply takes too much time to follow best code development practices and many simply don’t know what they are. Moreover, documenting one’s code to communicate precisely what is being done at each step is almost never followed unless enforced by one’s research group. This combo of poor communication through one’s code, exacerbated by the priorities for new research as quickly as possible, makes communicating with other researchers who are studying your work extremely difficult.
Even clearly designed and written code can take months to understand because of the need to understand the mental model that the code relies on, not just understanding what is done. Getting others’ code to run to investigate and verify published data forces researchers to spend less time thinking about hypotheses and experiments and more about troubleshooting the code. It also makes tracking down errors in others’ works much more time-consuming. The work may be correct, but it’s difficult to know if no one can look at it properly unless they spend an entire month trying to figure it out.
I hypothesize that this longer delay between publication and independent replication will lead original authors to take one of the following routes when responding to questions about replication:
- they will be less likely to recall precisely what they did, so they cannot offer advice on how to replicate their findings
- they may have moved on to other industries by the time replication studies occur and not have the time to diagnose problems encountered during replication
- they will succumb to the sunk-cost fallacy and double-down on their erroneous work, despite the fact that they nor others can show its validity
- they will not care about the replication because sufficient time will have passed, technologies will have changed, and fads in research will have shifted
But in the long run, I hypothesize that the lack of incentives for producing high-quality code will make it less likely that other researchers will care about replication by the time replication studies could be completed. This means that the act of independent verification will be disincentivized because it will just take so much work in the first place and no one will care when it’s done.
I understand that this is a hyperbolic and cynical view of the future of research. But the increasing reliance on code in scientific data analysis, especially in the “big data era”, means that the luxury of not needing to deal with the consequences of half-baked code is disappearing. When scientists are disincentivized from looking at the brittle foundation around them, they shouldn’t be surprised when their field begins to crumble. We are already beginning to see the effects of this crumbling foundation through anti-intellectual activity and the ongoing replication crisis. We do not need to give those who distrust academic research more reason to do so.
Knowledge, what research problems are worth pursuing, and the impact that knowledge has form an economic marketplace of ideas. This marketplace stems from the unique values in academia, which creates different incentive structures for students, researchers, and institutions. This complicated structure makes studying the consequences of public goods in science a problem that is likely to produce novel insights that both economists and scientists may appreciate. And the consequences of discoveries in this area, particularly with regards to grant funding, code stability, and the effects that these externalities pose for the logical foundations of scientific discoveries, may have broad-reaching effects.
As I have detailed above, it is my opinion that not investigating the effects imposed by this incentive structure puts the future of scientific research at risk. The status quo places an increasing load on poor-quality software and code, but does not incentivize the maintenance required to support this weight. This threatens both the validity of future findings and the pace at which new discoveries will be made. Reassessing academic values and the incentives those values create will be necessary to avoid these risks in most, if not all, scientific fields.
References & Footnotes
Generating appropriate data for testing certain hypotheses is always a problem, and often a trickier problem than one initially thinks it is. ↩
Open source is the term used to describe software whose source code is published for all to see. The antonym of this is closed source. ↩
To some extent, this still requires trust of the original researcher, since you cannot be sure that this is the exact code that was run to generate their results. But in theory, if you have the raw data they started with and the code they used to process it, you should be able to verify, independently of the original researcher, that this code produces the expected results. It is still up to you to determine whether the analysis or code is proper. But at least this provides for another mechanism to independently test whether the published results are as claimed. ↩
You can think of this problem like “why should I fund invest in your early stage, risky project, when all a future competitor needs to do, once you’ve shown your work is valuable, is to fork your project and throw more resources at it to outcompete you?”. Large software companies have done this many times. Even recently, Microsoft’s WinGet package manager is a not-so-subtle intellectual theft of Keivan Beigi’s AppGet package manager. Beigi invested the time to lay the groundwork for this project, showed the world there was a market for his work and his specific implementation of the idea, and once the demand was great enough to be noticed by Microsoft, they scooped up the idea without paying Beigi in any meaningful way. Why should I invest in you if you are giving people the knife they can use to stab you in the back? I’m unsure if there’s a term that already exists for this behaviour, so I’m coining “opportune parasite”. Feel free to let me know if there’s already a name for this. ↩
The motivation behind Open Source Collective, GitHub Sponsors, and other funding mechanisms to support open source software development are probably good places to start learning about these problems. ↩
To be fair, I’m not an economist. So I may not have been looking hard enough in the right places to find what I want. ↩
I want to be clear that I’m not saying this is necessarily a bad thing. It is a difference in scope and pragmatism, and each of these sets of values has its own space to thrive and impact societies. ↩
“When computational pipelines go ‘clank’” Nature Methods, June 2020 ↩
The value that these repositories provide to research communities can be recognized by continuing financial support. For example, Bioconductor has received funding from the NIH, Chan-Zuckerberg Initiatives, and other agencies since 2003. ↩