Please, show your work

Published: May 29, 2020   |   Read time:


Biology papers in the fields of genomics and epigenomics have become increasingly mathematical over the last couple decades, particularly with the introduction of genome sequencing into the reperetoire of most research labs. And as with many other research fields containing large datasets, machine learning has become a particularly popular tool. However, there is controversy around what role machine learning should play in biological research, and how to interpret the results of papers that use them.

Take this article1 from a few years ago (emphasis, mine):

Although these deep-learning networks can be stunningly accurate at making predictions, Finkbeiner says, “it’s still challenging sometimes to figure out what it is the network sees that enables it to make such a good prediction”.

Still, many subdisciplines of biology, including imaging, are reaping the rewards of those predictions. A decade ago, software for automated biological-image analysis focused on measuring single parameters in a set of images. For example, in 2005, Anne Carpenter, a computational biologist at the Broad Institute of MIT and Harvard in Cambridge, Massachusetts, released an open-source software package called CellProfiler to help biologists to quantitatively measure individual features: the number of fluorescent cells in a microscopy field, for example, or the length of a zebrafish. But deep learning is allowing her team to go further. “We’ve been shifting towards measuring things that biologists don’t realize they want to measure out of images,” she says.

For all its promise, deep learning poses significant challenges, researchers warn. As with any computational-biology technique, the results that arise from algorithms are only as good as the data that go in. Overfitting a model to its training data is also a concern.

This feedback cycle of machine learning-based research in biology is very different than what biologists from even 15 years ago are used to, and is changing the landscape of how research is done today.

The role of solutions to research problems

In my opinion, one of the major reasons why machine learning is currently both revered and viewed skeptically as a toolbox is due to the insights that can be gained from using it.

Solutions to research problems, broadly, serve two major purposes:

  1. To provide an answer to the initial question
  2. To use the insight gained by studying the problem elsewhere

Take an example from the mathematical literature: the twin prime conjecture. The conjecture is simple: there are infinitely many pairs of prime numbers that differ by two. The proof of this conjecture is notoriously elusive, but is widely regarded as being true. In 2013, in an attempt to prove this conjecture, Yitang Zhang published a proof2 of a weaker statement that \(\liminf_{n \rightarrow \infty} (p_{n+1} - p_n) < 7 \cdot 10^7\)3. This proof served these two purposes:

  1. While not answering the ultimate question, it provided the proof of a related result that further suggests the twin prime conjecture is true.
  2. The proof in the paper immediately inspired other mathematicians that were able to lower the upper bound presented by Zhang. Within a year, the finite bound had been reduced from \(7 \cdot 10^7\) to 246. Terence Tao and James Maynard both produced improved bounds using simpler methods than Zhang, but that were inspired by his work.

Insights from machine learning solutions to biological problems

Machine learning solutions to biological problems, like more traditional biologically-based solutions, address both of these purposes, as well. However, machine learning doesn’t address these purposes in a way that many biologists are used to.

Machine learning can do #1 really well in certain cases. Take genome segmentation algorithms, like ChromHMM4 and Segway5. These methods take in various epigenomic datasets on the same sample and spit out labels relating to the likely function of each region of the genome. It is known that these various epigenetic marks, such as histone tail modifications, chromatin accessibility, and gene transcription, are all important markers of how a cell operates and what different parts of the genome does. But the encoded features that these algorithms extract from the input datasets is not easily interpretable. So while these functional annotations are useful for many purposes, it is not necessarily clear how or why certain regions are annotated as they are.

Which leads me to the second purpose, and how it differs from typical biological research. Machine learning, often, doesn’t give biological insight, because it doesn’t solve problems with biological solutions. Machine learning solutions give insight into developing better algorithms for biological problems.

Take many of the recent papers from Anshul Kundaje’s lab. There are a number of papers that use a “one-hot encoding” of DNA sequences as inputs to the various machine learning models6, 7, 8. While all of these methods perform reasonably well9, the most obvious insight you can get from them is “when developing machine learning models for DNA, a one-hot encoding of nucleotides is broadly applicable and likely useful”. This is obviously not a biological insight, but is useful nonetheless. And this discrepancy between biological insight and computational insight is what I believe is at the heart of the reverence-vs-skepticism divide.

A historical example of the same divide

This divide of solutions and answers is reminiscent of the four colour theorem. The problem can be simply stated as “for any map, how many colours do you need to colour the map such that no neighbour has the same colour”. The initial proof of this theorem10-11 was controversial because it used computers to do so. Appel and Haken started with the space of all planar maps and introduce a “discharging” procedure to reduce the number of equivalent maps they need to consider. They then use computers to check that each example of the reduced maps is four colourable.

While parts of the proof are useful and provide insights into how to work with graphs, arguably the main insight of the solution is that “we can use computers to check large numbers of cases instead of finding ways to further simplify the problem”. Again, this is not a mathematical insight so much as it is a computational insight. Understandably, this was not entirely pleasing for mathemticians, even if the proof itself is logically sound.


Machine learning and other computational methods are tools. And like all tools there are pros and cons to using them in different problem spaces. Computational biologists may be interested in understanding how to build better algorithms, but wet lab biologists usually are not. This divergence in interests is strongly apparent in how machine learning models are applied to biological problems and limits the transferability of computational solutions into biology labs and the clinic.

To bridge this gap, I recommend that computational biologists prioritize interpretable machine learning models that can be further dissected to provide biological insight into how and why they work well. This approach is a middle ground that can provide both computational and biological insights into biological problems, which are more likely to be impactful in the long run.

References & Footnotes

  1. Webb, S. Deep learning for biology. Nature (2018). doi: 10.1038/d41586-018-02174-z

  2. Zhang, Y. Bounded gaps between primes. Annals of Mathematics (2014). doi: 10.4007/annals.2014.179.3.7 

  3. Numerphile has a number of videos on the topic, including this and this

  4. Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nat Methods 9, 215–216 (2012). doi: 10.1038/nmeth.1906 

  5. Hoffman, M. M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods 9, 473–6 (2012). doi: 10.1038/nmeth.1937 

  6. Greenside, P. et al. Discovering epistatic feature interactions from neural network models of regulatory DNA sequences. Bioinformatics (2018). doi: 10.1093/bioinformatics/bty575 

  7. Movva, R. et al. Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PlOS One (2019). doi: 10.1371/journal.pone.0218073 

  8. Nair, S. et al. Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. Bioinformatics (2019). doi: 10.1093/bioinformatics/btz352 

  9. More or less. Some more than others. 

  10. Appel, K. & Haken, W. Every planar map is four colorable. Part I: Discharging. Illinois Journal of Mathematics (1977). doi: 10.1215/ijm/1256049011 

  11. Appel, K., Haken, W., & Koch, J. Every planar map is four colorable. Part II: Reducibility. Illinois Journal of Mathematics (1977). doi: 10.1215/ijm/1256049012