Well-defined biology

Published: December 19, 2019   |   Updated: March 04, 2020   |   Read time:

Tagged:

I’ve struggled with many things switching from mathematical physics in my undergrad to computational biology in my grad studies. One of these struggles is due to definitions.

For anyone that’s taken any sort of formal math courses, you’re aware that definitions are aplenty, and you can define whatever you want, however you want. Biologists also seem to take this approach - just look at how many different kinds of RNAs there are. But these definitions have always felt lack-lustre, to me, in comparison to definitions I typically encounter in math.

So what I want to focus on is the topic of definitions being good and well-defined.

What makes a good definition

Let’s start the way any good math class starts: a few examples to get a sense of what I mean.

Example: Functions

Let \( A, B \) be sets, and \( C = A \cup B \). Define \( f: C \rightarrow {0, 1} \) by \[ f(c) = 0 \text{ if } c \in A, \text{ and } f(c) = 1 \text{ if } c \in B \] What happens if \( A \cap B \neq \emptyset \)? Well…we don’t know.

Clearly if \( c \in A - B \) then \( f(c) = 0 \), and if \( c \in B - A \) then \( f(c) = 1 \). We can then say that the function \( f \) is well-defined over \( A \oplus B \). But if \( A \cap B \neq \emptyset \), then we don’t know what \( f(c) \) should be for \( c \in A \cap B \).

Thus, \( f \) is not well-defined if \( A \cap B \neq \emptyset \).

Example: Notation

Let \(a, b, c \in \mathbb{R} \). Instead of writing \(a^b\), we can use \( \wedge \) notation, and write it as as \(a \wedge b \). The meaning is the same, we just have a different notation.

So what can we make of the statement \(a \wedge b \wedge c \)?

Is it \( a \wedge (b \wedge c) \) or \( (a \wedge b) \wedge c \)? Since exponentiation isn’t an associative operation, the statement \(a \wedge b \wedge c \) is ambiguous. One would say that this \( \wedge \) notation, without more context or convention, is not well-defined.

Notice that \( a^{b^{c}} \) is unambiguous. It doesn’t suffer the same ambiguity as our \( \wedge \) notation does and is, therefore, well-defined without the use of extra symbols.

Example: Limits

Consider the function \(f(x) = \frac{1}{x}\). What happens to the function as \( x \rightarrow 0 \)?

Insert graph of 1/x

If you approach 0 from the right with positive values, it’s easy to see that \( \lim_{x \rightarrow 0^+} = +\infty \). Similarly, if you approach 0 from the left with negative values, it’s easy to see that \( \lim_{x \rightarrow 0^-} = -\infty \).

So what is \( \lim_{x \rightarrow 0} \frac{1}{x} \)? It depends on where you start and how you get arbitrarily close to zero. In that case, it makes sense to say that \( \lim_{x \rightarrow 0} \frac{1}{x} \) does not exist, because the limit in this context is not well-defined.

The limit does not exist -100%

Interestly, this not-well-defined-ness property actually spurs its own follow-up definition: continuous. These are functions where the limit is well defined at all points in the domain of the function.

Hopefully it’s clear that even through these simple examples that thinking about when a definition or function or property is well-defined is an important concept in and of itself. Sometimes not being well-defined can lead to insights about what a better definition or function might look like.

Defining “well-defined”

The above examples show a few properties of “good” definitions that I’d like to highlight and name. Good definitions are:

  1. unambiguous with respect to different representations of the same objects (think of the written statement in the notation example)
  2. consistent with definitions from which they extend (think of the function that stems from the definition of set containment)
  3. not dependent on technicalities of the measurement (think of why it should matter what side you approach the limit from1)

Where biological definitions get messy

From my experience, biological concepts are usually pretty good with Property 2. This makes sense, since in science you build on the work that was done before you. To do that, you need to start with what was given. What I see a lot of, though, is an issue with Property 1 (Property 3 is troublesome as well, but often relates to Property 1, so I’ll focus on Property 1 for now).

Here’s an example: I study epigenomics, a subfield of genomics, concerning gene regulation and non-genetic inheritance. An important definition in this field is that of promoters: a “regulatory element” of the genome (i.e. the nucleotides may not code for proteins themselves, but they can affect how genes are transcribed) that is found upstream of a gene’s transcription start site.

Lots of papers get published about promoters, their effects on certain genes, and how systematic changes to promoters can alter the phenotypes of cells.

Let’s refer to the Glossary in Molecular Biology of the Cell2.

Promoter: Nucleotide sequence in DNA to which RNA polymerase binds to begin transcription

Promoter diagram -80%

This definition, as defining the concept, isn’t a bad definition. It conveys a specific idea and is unambiguous given the definitions of DNA sequences, RNA polymerase, and the process of transcription. This definition, however, falls short when we look at its use in practice, and think about what exactly is “the sequence” we consider that RNA polymerase binds to.

When used in genomics research papers, you find phrases like (emphasis mine):

Promoters were defined as 1 Kb upstream and 1 Kb downstream of hg19 Refgene gene transcription start sites (TSSs)3.

Core promoters (n=20,245 genes): For every protein-coding gene, we define as core promoter the interval [-250,+250] bp from any transcription start site (TSS) of a coding transcript of the gene, excluding any overlap with coding regions. TSSs were obtained from Ensembl Genes v75 GRCh37.p134.

As a convenient operational definition, we refer to ‘promoter’ in this paper as the genomic region (-700, +300) bp with respect to the transcription start site (TSS)5.

As you can see, none of these papers use a common working definition of a “promoter”, yet they proceed to make claims about promoters, in general. And, as the last quote states, inconsistent definitions of “promoter” isn’t out of malice, but out of convenience. To define the precise location of promoter regions across the entire genome would be a monstrous task, and is likely not worth the investment required to tackle the subtly of each instance of a “promoter”.

Some other examples that are easy to point to are enhancers in biology6 and what the “millenial” generation is in demographic studies7.

The point is, however, that this inconsistent definition violates Property 1, and thus, isn’t a good definition. The definition may be practical and attempt to convert the conceptual “promoter” into a measurable object, but that doesn’t make it good.

It is unclear how these various definitions affect the ability of results that use different definitions to be compared to each other. While this is not, in my opinion, the main culprit for the ongoing issues around reproducibility in science, I don’t believe loose definitions and ill-defined constructs are beyond reproach, either.

What can be done

For published results, it’s difficult to go back and fix these issues (and will probably never happen). But for new results, if your particular results are sensitive to your precise definition, that seems to point in the direction of not being a good result to work with. If you use a window of [-500, +500] bp around a TSS for a promoter, and get one result, but get a completely different result if your window is changed to [-1500, 500] bp, I’m inclined to believe it’s not a good result. At the very least, it doesn’t fit with the conceptual definition of a promoter, which is the entire point of making practical definitions in the first place. You probably need some other experiments to justify why your definition of a promoter is a good one to use in the first place.

Conclusion

There’s an important point I want to make. I do not expect biology to be as rigorously defined and inscrutable as mathematics. I’m not looking for axioms of biology, or anything of that sort. Physics tried to do that, and that’s still unresolved8.

What I am looking for, though, is a stronger application of all 3 properties, particularly Property 1. I hope that thinking about precisely where these definitions are well-defined can spur some thought into the essence of the objects one is trying to define and whether it’s worth doing.

2020-03-04 update

This entire issue of definitions, especially “biological” versus “operational” definitions of enhancers, was the subject of a recent review article9. This is an example of my entire argument, here, which also underlies the difference between what “is” and what “is measured”. There is active research in quantum mechanics about epistemilogical and ontological physical models, since the “measurement problem” is at the heart of quantum mechanics. I was exposed to these types of models and ideas during my undergraduate research with Dr. Joseph Emerson, and they’ve stuck with me ever since.

References & Footnotes