Is your null hypothesis, or your model, more likely to be rejected?

Published: March 01, 2021   |   Updated: April 07, 2021   |   Read time:

Tagged:

Image Attribution:

What is a p-value? The textbook definition is, understandably, boring and uninsightful.

In statistical testing, a p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.

This definition has made a lot of people very angry and been widely regarded as a bad move1. To see why this causes problems, let’s think about what scientists are trying to do and how p-values fits into it.

An example experiment

In biology, transcription factors are important proteins that help cells transcribe genes. Transcription factors usually bind to certain sequences of DNA, called transcription factor motifs.

If you have a set of genomic locations of interest, and you want to find what transcription factors have binding sites “enriched” in these regions compared to others, it’s not too difficult to find out. Stop by your favourite motif enrichment tool, like Homer or AME, give the sequences at your list of coordinates, and get served back a list of transcription factors.

Let’s test that one of these tools works by burying a known motif within some random DNA sequences. We can use the forkhead motif, TGTTTACTTTG. We can make some random sequences of DNA with the following Python code.

from numpy.random import choice, randint
from Bio.Seq import Seq
from Bio import SeqIO

# letters in the DNA alphabet
alphabet = ["A", "C", "G", "T"]
# make 1000 random sequences
n_seqs = 1000

seqs = [SeqIO.SeqRecord(
    seq=Seq(
        # random stuff before the motif
        "".join(choice(alphabet, randint(0, 20, 1)))
        # the motif
        + "TGTTTACTTTG"
        # random stuff after the motif
        + "".join(choice(alphabet, randint(20, 30, 1)))
    ),
    id="Seq_" + str(i),
    description="Seq_" + str(i)
    ) for i in range(n_seqs)
]

# save the random sequences to a file in FASTA format
with open("random.fa", "w") as fo:
    for seq in seqs:
        SeqIO.write(seq, fo, "fasta")

We can then upload random.fa to AME, which gives us the following results.

rank motif_ID consensus p-value adj_p-value
1 FOXK1_HUMAN.H11MO.0.A TGTTTMCHTT 3.22e-825 4.78e-822
2 FOXA1_HUMAN.H11MO.0.A TGTTTACWYWGB 1.35e-817 1.57e-814
3 FOXA2_HUMAN.H11MO.0.A TGTTTACWYWGB 6.56e-811 7.50e-808
4 FOXM1_HUMAN.H11MO.0.A TGTTTRCTYWKB 6.56e-811 9.00e-808
5 FOXJ2_HUMAN.H11MO.0.C TGTTTRTTTW 1.05e-791 1.22e-788

In good news is that we get the forkhead motif back. The bad news is that AME is way too confident about its findings.

For reference, there are \(52! \approx 8 \times 10^{67}\) ways to shuffle a deck of playing cards and there are ~ \(10^{80}\) particles in the universe2. AME calculated a p-value \(\approx 10^{-822}\). What does this value even mean? AME is saying that the probability the forkhead motif is not contained in the sequences I provided is less than being able to correctly identify a single particle out of the entire universe. If that doesn’t raise some red flags, I don’t know what will.

Physics, which has a general “five sigma” rule of thumb, doesn’t expect results that stringent3. And physics has better quantitative models of reality than biology does. So why do these p-values show up in biology, and why are they in scientific publications?

Being so certain in a result is absurd, since there are other factors in your experiment that will likely give you more noise than you anticipate. This p-value is so small that it loses all meaning. These p-values can only exist is some ideal world that doesn’t reflect reality at all 4. It should be clear that producing p-values this small is itself an heuristic of a bad statistical model. This isn’t necessarily bad, it can still be a useful model. But it certainly doesn’t model reality accurately.

Are you really that certain your experimental design, your experimental conditions, your data collection, and your statistical analysis are perfect and that your null hypothesis should be rejected? Again I would like to stress that I made this out of 1000 randomly generated sequences. If routine, everyday experimental observations are more unlikely than your ability to correctly pick out a single particle from the entire universe, then I’m willing to bet that there’s something wrong with your model. Because the universe is really, really, big5. And I’m pretty sure you’re not that good at Where’s Waldo.

What do you know, and what do you not know?

The philosophy of science has a long history. Currently, falsifiability is in vogue. A hypothesis should make predictions and those predictions should be testable by empirical evidence. If the evidence goes against the hypothesis, the hypothesis has been falsified in some way. But how does this translate into data and stats?

If we want to figure out how something works, we propose experiments where if the hypothesis is true, something happens, and if the hypothesis is false, nothing happens. This is how we can turn a hypothesis into a falsifiable experiment. But how can we be sure that “something” happens? That’s where statistics comes in.

Falsifiable statistics

Let’s call the scientific model that best describes our system of interest \(M\). If we have some new hypothesis, \(N\), that we want to test, the first step in our process is to make two models of the world.

  1. “Null” model: \(H_0 = M \wedge \neg N\) (everything we know so far and our new hypothesis is false)
  2. “Alternative” model: \(H_a = M \wedge N\) (everything we know so far and our new hypothesis is true)

This makes a falsifiable scenario for our experiment.

Our second step is to convert this experiment into a statistical test. Under the null model, some statistic that we derive from the data in our experiment, will have some known distribution. Let’s call this test statistic \(T\) with distribution \(\mathcal{T}\).

Our third step is to calculate our observed value of the test statistic from the data. Let’s call this observed value \(t\).

Our fourth and final step is then to calculate how likely observing \(t\) is assuming the null model. That gives us a p-value, which is defined above. Mathematically, that looks like this:

\[p = \mathbb{P}[T < t | H_0]\]

In plain English, it looks more like this:

\[p = \mathbb{P}[\text{test statistic is what we observe it to be} | \text{current scientific model is true and new hypothesis is false}]\]

If the data we see is very unlikely, then we have falsified \(H_0\) and can reject it 6. Mathematically, this means \(p < \epsilon\), where \(\epsilon\) is some really small value we define before the experiment.

To summarize, we test hypothesis with the following steps:

  1. Propose a falsifiable hypothesis
  2. Create a falsifiable experiment to test the hypothesis
  3. Convert the design of the experiment into a statistic
  4. Perform the experiment and calculate the p-value
  5. Reject the negation of your hypothesis if \(p < \epsilon\)

Issues with the null hypothesis testing framework

Explicitly laying out this framework lets us see where some problems come up. It comes down to problems that come up translating each point into the next one.

1 -> 2:

  • Falsifiable hypotheses require a certain type of experiment.
  • You can come up with multiple contradictory hypotheses that produce the same observations for a particular experiment, so it’s important that the experiment strongly maps back to testing only \(N\), only the new part of the hypothesis you are trying to test.
  • If you’re not considering data that will directly pit the two hypotheses against each other, then you can find multiple statistical tests that will correspond to tests of these different hypotheses. You also may not have the know-how to technology capable to answer that question, at the moment.
  • Even if you are able to come up with an experiment, it may not be falsifiable.

2 -> 3:

  • Statistical tests and scientific hypotheses are not in a 1:1 correspondence.
  • You can come up with multiple statistical tests corresponding to the same scientific hypothesis. These will give different statistical distributions, with different powers, discovery rates, etc.
  • You may not be able to convert the experiment into a simple statistic that obeys a nice analytical distribution. This may make it harder to interpret, model, calculate, etc. If this is the case, don’t be surprised if some scientists use less-accurate-but-more-understandable models.

3 -> 4:

  • You may overlook important factors in the experiment that influence your statistic in key ways. This disconnect between statistical model and reality hopefully can be ignored, but sometimes this confounding rears its ugly head 7.
  • Non-independence of features may play a strong confounding role, leading to suspicious distributions of p-values 8.

4 -> 5:

  • Extremely small p-values, like the ones mentioned above in motif enrichment, may need extra information to be interpreted correctly.
  • Considering the effect size (e.g. fold change in differential analyses in computational biology) and positive/negative controls within your data may be necessary to ensure what you’ve rejected with your null hypothesis testing framework actually makes sense.

Conclusions

All in all, it’s important to remember the fact that you are human and humans make mistakes. This may include your plans, you models, your experiments, and your interpretation of observations. This all comes out in statistical hypothesis testing, too.

We (scientists and non-scientists) often talk about p-values as if they are what determines what is true or not, but that’s simply not the case. It’s also true that the way we (scientists and non-scientists) talk about p-values confuses what they actually are.

A p-value defined by

\[p = \mathbb{P}[\text{test statistic is what we observe it to be} | \text{current scientific model if true and new hypothesis is false}]\]

is very different from this other value:

\[p' = \mathbb{P}[\text{seeing something as outlandish as your observation}]\]

which is also very different from this other value:

\[p'' = \mathbb{P}[\text{seeing something as outlandish as your observation} | \text{your scientific hypothesis is true}]\]

Clearly these are not the same. Clearly \(p \ne p' \ne p''\).

But it’s easy to make the mistake that they are the same or that these necessarily correspond to some scientific hypothesis. It shouldn’t come as a surprise that translating scientific hypotheses into the “right” statistical test is not straightforward. It’s a bit of an art form. It also shouldn’t come as a surprise that much effort in data analysis goes into finding the “right” model.

What constitutes the “right” model, you may ask? It’s tough to know. But it’s often easier to come up with “wrong” things that should definitely raise some red flags to be wary of, like extremely small p-values.

References & Footnotes

  1. Misusing p-values, a la Douglas Adams 

  2. How many particles in the Universe? - Numberphile 

  3. If you have a normal distribution, the probability that you observe a value more than 5 standard deviations below the mean is \(\approx 3 \cdot 10^{-7}\). Physics experiments that produce a p-value less than this meet the gold standard for hypothesis testing. See this article for more. 

  4. This concern is not new, especially in computational biology. See this article by Wolfgang Huber, advocating for reasonable interpretation of p-values, and how extreme results should be tempered by the experimental and analytical methods used to produce them. 

  5. The Scale of the Universe 

  6. It’s important to note that through this method, we reject \(H_0\), we do not accept \(H_a\). This keeps our entire process in line with the falsifiable framework of modern science. 

  7. The entire study of batch effects and how to correct for them in high throughput sequencing are a painful lesson for biologists, both wet- and dry-lab, can’t forget. See this review for some examples. Newer methods have been published since then, such as this method for modelling batch effects in single-cell RNA-seq data, for newer technologies that continually seem to forget these problems. 

  8. I often refer to this blog post by David Robinson about how to interpret the results of your experiments by looking at the distribution of p-values, post hoc. This is often a useful step to do before looking at any single result. If you see a uniform, bimodal, or some other distribution of p-values, it should give you pause before you go on to interpret any of the individually significant results.