What is differential analysis, anyway?

Published: October 15, 2020   |   Read time:

Tagged:

Genome sequencing, since its inception in the 1960s, has revolutionized many fields in biology. Many people, even outside biology, are aware of this. But how does sequencing work as a measurement device? And how does it fit into scientific experiments? How do we go about analyzing this data? I want to address these questions with a focus on the latter.

“Differential analysis” is the generic term given to identifying differences in sequencing data. It comes in many forms and is used in many applications. But the formulation of it, statistically, appears complicated, and is often very difficult to understand for newcomers to the fields of biology and bioinformatics.

In this post, I want to break down the motivation behind differential analysis and explain where the complicated statistics comes from. I want to do this to shed some light on why things are as complicated as they are, and not just gloss over the details by saying something like “just use {insert package name}, it takes care of the details”. Details are important, especially in science, and we need to understand what’s happening so that we’re not just walking blindly.

1. What are we trying to do with sequencing data

Studying a phenomenon, scientifically, involves comparing some system you’ve designed to a system that occurs naturally. An easy example is with the case of a drug treatment. Say that a group of people has a particular disease, and we think a new drug can help them. The question we ask is “do they get better when they take this new drug?” We can measure a bunch of things about the patients before and after treatment and compare them to see if things improved for the patients.

Treating patients with a drug to treat their disease -100%

The same idea is applied with scientific experiments involving sequencing data. We have a couple conditions we want to compare, say a “control” and “treatment” group. We measure a bunch of features related to the DNA or RNA of samples in our groups and compare them. Our goal is to identify which features change according to the group they’re in.

Setup for biology experiments using sequencing data -100%

We are uncertain about which biological features change, and we want to become less uncertain about it. To do this, we’re going to want to use statistics. We need to figure out a way to translate our abstract biological features and measurements into something quantitative that we can work with. This starts with thinking about how we’re going to model the situation in front of us.

2. Making a model for the experiment

In broad strokes, we have some features we want to measure. These can be the accessibility of chromatin at each position in the genome, or it could be a particular mRNA transcript, or almost anything else. Let’s call this the concentration of each feature. We don’t need to nail down what exactly we mean by concentration just yet. We want to compare our treatment group(s) to our control group, so we’ll want to have some idea of the concentration of each feature in the control group. This “treatment group” can be a complicated mix of factors, like “gender and height” or “cell type and age”. So we can generalize this idea and call it a biological condition. We want to know what effect this condition has on the concentration of each feature. So can outline this idea with this simple equation:

\[\begin{align*} \{\text{concentration of feature }f\} &= \{\text{concentration of } f \text{ in controls} \} \\ &+ \{ \text{biological condition } c\} \times \{ \text{effect of condition } c \text{ on feature } f \} \end{align*}\]

To use some notation to keep things concise:

\[y_f = \beta_{f,0} + x^T \beta_f\]

where \(y_f\) is the concentration of feature \(f\), \(\beta_{f,0}\) is the concentration of feature \(f\) in the controls, \(x\) is a vector containing information about the biological condition of a sample, and \(\beta_f\) is a vector containing the effects of each condition on the concentration of feature \(f\).

Say that there are \(F\) features and \(C\) conditions in total. This equation holds for each feature. We can combine all those equations together by treating each feature as a row in a vector. Our big combined equation then looks like this:

\[Y = B x\]

We can summarize the variables like so:

Variable Value Units Description
\(Y \in \mathbb{R}^F\) \(\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_F \end{bmatrix}\) “concentration” units The concentration of each feature
\(x \in \mathbb{R}^{C + 1}\) \(\begin{bmatrix} 1 \\ x_1 \\ \vdots \\ x_C \end{bmatrix}\) “condition” units The design matrix containing information about the conditions
\(B \in \mathbb{R}^{F\times(C + 1)}\) \(\begin{bmatrix} \beta_{1, 0} & \beta_{1, 1} & \dots & \beta_{1, C} \\ \beta_{2, 0} & \beta_{2, 1} & \dots & \beta_{2, C} \\ \vdots & \vdots & \ddots & \vdots \\ \beta_{F, 0} & \beta_{F, 1} & \dots & \beta_{F, C} \\ \end{bmatrix}\) “concentration” per unit “condition” The effect matrix containing information about the effect of each condition on the concentration of each feature

We implicitly make \(x_0 = 1\) to keep the equation concise and to always have a place for the baseline control concentration \(\beta_{f,0}\).

Now that we have an equation to work with, we can try and restate our biological problem in mathematical terms. Our goal, mathematically, is to find all elements of the matrix \(B\) (i.e. \(\beta_{f,c}\) where \(c \ne 0\)) that are not 0. These are the conditions \(c\) that have some measurable, non-zero effect on feature \(f\).

\[\begin{align} \text{Identify which features differ } & \text{as a result of the biological condition} \\ &\Bigg\Updownarrow \\ \text{Find all } \beta_{f, c} &\ne 0 \forall c \ne 0 \end{align}\]

Everything that follows from here is about building a statistical model for \(Y\) to make it reflect the realities of our sequencing data and sources of noise in measurement. We are going to figure out how to fit \(B\) to our data to identify as many non-zero effects as possible while not fooling ourselves and falsely detecting effects that aren’t real and are a result of the noise in our measurements.

Because I’ve described the situation so broadly, there are still lots of things that we haven’t specified yet:

  1. If we could measure anything, what is it that we’d want to measure?
  2. What exactly is concentration? What is the thing that we can measure with our sequencing machines and how does it relate to what we want to measure?
  3. What is the statistical model we’re going to use to fit the data against? How can we know that this fit is accurate?

Let’s jump into the details to try and answer these questions and build up our statistical model.

3. Connecting measurements with biology

If technology or chemistry was no burden, if we wanted to measure the state of the DNA or RNA inside a cell, we would probably want our concentration to be something like the number of molecules per nucleus. This would give us a cell-specific measurement of the amount of DNA or RNA inside each nucleus. From here, we could easily talk about the copy number of different segments of DNA, the absolute expression of a given gene, etc.

But the technology of sequencing machines is limited. We can’t get such a fine-grained detail of molecules inside cells (at least, not yet). So the first thing we’ll need to look at is roughly how sequencing machines work and what the data we get out from it looks like.

3.1 What data does a sequencing machine produce

Briefly, sequencing machines that use short-read technology, such as those made by Illumina, follow these steps.

  1. Break up the DNA or RNA molecules in your cells into small fragments, typically a few hundred base pairs in size.
  2. Attach these fragments to a substrate called a flowcell, and get them to give off light of a certain colour to say what base is in what position in the fragment.
  3. Save these sequences of base calls, called reads, in a FASTQ file.
  4. To find out what position in the genome each read came from, use a sequence alignment algorithm to align these reads to the reference genome.

This gives us a measure of the total number of reads that correspond to a certain feature of the genome1.

A high level overview of how sequencing data is made and how it is quantified -100%

There are lower limits on the detection threshold for molecules inside a cell, and most sequencing experiments don’t give measurements per cell, but per collection of cells2. Our first insight, then, into connecting our ideal measurement with real sequencing data is that concentration will be in relative units, relative to some unknown lower detection threshold.

Connection 1: Concentration is a relative unit. It is defined relative to some unknown, lower detection threshold of the sequencing machine.

The second insight we can see is that if we follow steps 1-4, above, we will get counts of the number of reads corresponding to each feature directly from the sequencing machine. Features can be whatever we want. They can be particular transcripts, they can be sites of accessible chromatin, they can be splice junctions, they can be anything. As long as we can define and quantify the features appropriately, then we can use them.

Connection 2: The raw data we get from sequencing machines is in the form of counts of each feature. These counts are relative to the lower detection threshold.

You may be tempted at this point to set our concentration vector, \(Y\), equal to the number of raw counts and be done with it. But nothing says that we have to. And we’re going to show, next, why we wouldn’t want to do this. Ultimately, we don’t need to make concentration be equal to the raw counts, but we need to connect them somehow. If we define \(K \in \mathbb{N}^F\) to be the vector containing the raw counts of each feature our next immediate goal is to connect our counts, \(K\), to our currently undefined concentrations, \(Y\).

\[K \rightarrow ? \rightarrow Y\]

3.2 Connecting concentration and counts

3.2.1 Sequencing depth

An important parameter in a sequencing experiment is how many reads you want to measure. For most applications, this number is around 30 M reads per sample, but can go as high as 1 B reads per sample. This number is chosen ahead of sequencing, but it is somewhat stochastic. For example, I may want 60 M reads per sample, but have 59.7 M for one and 61.2 M for another.

How does this factor affect our model? Consider the following scenario, where there are 3 features of interest, A, B, and C.

Different sequencing depths different samples -100%

Are these two samples actually different? Maybe. Feature A in Sample 1 has three times the number of reads as Feature A in Sample 2. But Sample 1 has three times as many reads as Sample 2, across the board. This isn’t exactly a fair comparison. The lower detection limit per sample is different because the total number of reads in each sample is different. Equalizing the total number of reads, also called the sequencing depth, is important, and gives us another insight.

Connection 3: The sequencing depth of each sample must be considered.

So where in our model, \(Y = Bx\), should this go? The most obvious place would be in our design matrix, \(x\). But look again at the example above. The sequencing depth affects the measurement of every feature roughly equally (i.e. \(\beta_{f, depth} = \beta_{depth}\)). If we include it as a condition in the design matrix, what does that mean for our other conditions?

Consider a simple experiment where we have two groups, control and treatment. If we let \(x_1 =\) sequencing depth and

\[x_2 = \begin{cases} 0 & \text{if control} \\ 1 & \text{if treatment} \\ \end{cases}\]

then our model looks like

\[Y = \begin{cases} \beta_{f, 0} + \beta_{depth} x_{depth} & \text{control} \\ \beta_{f, 0} + \beta_{depth} x_{depth} + \beta_{f, treatment} & \text{treatment} \\ \end{cases}\]

The value of \(\beta_{f, treatment}\) has to scale up or down to match the scale of \(\beta_{depth}\). Because each feature can have wildly different concentration , it makes the interpretation of the coefficients much more difficult. It depends on the sequencing depth, the total number of features you measure, and possibly other conditions if the experiment is more complicated than this.

So instead of having sequencing depth as a specific condition, we could scale the entire equation. This would give us something like \(K \sim Y = sBx\), for some scale factor \(s\). Because we’re trying to connect \(K\) to \(Y\), let’s simply this and put the sequencing depth on the left side of the equation.

This can be expressed as

\[\frac{1}{s}K \sim Y = Bx\]

One last question about the sequencing depth: should we scale low depth samples up, or high depth samples down? To answer this, think about the following question. If you have two baseball players, one with a batting average of 0.500 after 10 years of playing, and other other with an average of 0.950 after 2 months of playing, whose batting average do you trust more? Which one is likely more accurate?

The batter with more at bats is inherently more trustworthy, since you’ve seen so many examples. The average for the batter with fewer at bats is more likely to be affected by luck in the first few at bats.

Because of this, it makes more sense to scale down our high depth samples. These are less prone to noise from a small sample size, whereas you can scale down high depth samples and keep the ratios between features relatively even.

Connection 4: High depth samples should be scaled down to the same depth as low depth samples.

This is the biggest between-sample factor to consider that affects all features that we need to contend with. In addition to between-sample, factors, though, there are also between- feature factors that need to be considered. We consider probably the most important one, feature length, next.

3.2.2 Feature length

Consider a situation where we’re looking at the concentration of three features, A, B, and C, like below.

Multiple *features* of different lengths -100%

It should be pretty clear that features of different lengths will have different expected counts, even if every read is assigned to a random position, uniformly.

Connection 5: The length of each feature needs to be considered.

If we denote the lengths of each feature as \(l_f\), we can make a diagonal matrix, \(L\) and multiply the counts to normalize them based on the length of each feature

\[\frac{1}{s}L^{-1}K \sim Y = Bx\]

where

\[L = \begin{bmatrix} l_1 & 0 & \dots & 0 \\ 0 & l_2 & \ddots & 0 \\ \vdots & \ddots & \ddots & \vdots \\ 0 & \dots & 0 & l_F \\ \end{bmatrix}\]

3.3 Making it easy to model

Until this point, we’ve accounted for the most important and obvious factors related to measuring sequencing data. Now, we want to make this model work with some of the statistical tools we have available to us.

On the left-hand side of the model equation, we have something that derives from raw counts. But on the right-hand side we have something continuous. Counts are discrete, and much more “jumpy” than continuous variables, so it would help to smooth our counts out a little bit.

Connection 6: Counts should be smoothed to better align with the effect coefficients and to allow for general conditions.

We can achieve this easily by taking a logarithm transformation of the LHS.

\[\log_2 \left( \frac{1}{s} L^{-1} K \right) \sim Y = Bx\]

where the logarithm is done element-wise on the resulting vector. The logarithm base doesn’t matter, here, you can pick whatever base works for you. But like picking which side of the road to drive on, it doesn’t matter which side you pick so long as everyone knows where each side is being used.

But counts can be 0, which doesn’t play nicely with logarithms. To avoid problems, we should define an offset, \(\gamma\), and add it to each element before taking the log3.

\[\log_2 \left( \frac{1}{s} L^{-1} K + \gamma \bar{1} \right) \sim Y = Bx\]

where

\[\bar{1} = \begin{bmatrix} 1 \\ \vdots \\ 1 \end{bmatrix} \in \mathbb{R}^F\]

After all this work, we can finally say enough is enough. We have

  1. made a model that accounts for the control concentration and can model single or multiple effects,
  2. connected the raw data we get to the model,
  3. accounted for important between-sample factors,
  4. accounted for important between- feature factors, and
  5. converted integer-based counts to a continuous value.

We can now reasonably say that our LHS is a good representation of our data that will fit into our model. Let’s make the final connection by removing the \(\sim\) in our equation that stood for us not knowing exactly where this road of model development would end. We can set them equal to finish the model.

\[\log_2 \left( \frac{1}{s} L^{-1} K + \gamma \bar{1} \right) = Y = Bx\]

3.4 Accounting for randomness

Despite all this talk about statistics, we haven’t actually talked about randomness in this model yet. We’ve connected the raw data to our model, but we don’t have any indication of where sampling error or measurement noise come in. To do this, we’re going to look at the model itself.

While we haven’t explicitly talked about it, this model we’ve developed fits perfectly under the generalized linear model (GLM) framework. The linear part refers to the design matrix and coefficients on the RHS of the model, and the generalized part refers to the non-trivial relationship between the raw counts, \(K\), and our concentration vector, \(Y\). GLMs require three components:

  1. A linear predictor \(Y = Bx\)
  2. A link function, \(g\), such that \(\mathbb{E}[K \vert x] = \mu = g^{-1}(Y)\)
  3. An exponential family of probability distributions

We started with the linear predictor and built a realistic link function, above. But we can start to see where the randomness comes in. Given our design matrix for a given sample, we have that the expected counts should be given by the inverse of the link function.

GLMs have been derived to work for exponential family distributions because of their wide applicability. We don’t know ahead of time that forcing ourselves to only consider these types of probability distributions for the counts is a great decision. But statistics is all about building realistic and useful models, not perfect ones4. So we can try this out, see if it works, and if it doesn’t, try something else.

What probability distribution should we choose, then? Binomial distributions are often used for modelling count data, so we could start there. Sequencing usually involves millions of reads, though, and binomials get difficult to calculate at that scale. It’s known that you can approximate a binomial distribution with a Poisson distribution for large counts, so that might be better. But Poisson distributions have the same mean and variance, which might be too restrictive of an assumption. A negative binomial distribution has a more flexible relationship between the mean and variance, so this might be a sweet spot between simplicity and flexibility.

Debating these questions, above, was a huge topic of debate in scientific publications for a number of years. Here is a review of just some of the dozens of papers and methods related to modelling RNA-seq data from that period. Suffice to say, many researchers settled on using the negative binomial distribution, parameterized by mean \(\mu_f\) and dispersion \(\alpha_f\). There are more subtle differences between these many models, but for simplicity, we can use this fact.

\[K \sim \mathcal{NB}(\mu, \alpha)\]

With this, we now have all three components of a GLM.

4. Summarizing the model

It’s been a bit of a winding road to get here, so let’s summarize where we are. Our model for the differential sequencing data is given by the following equation:

\[\log_2 \left( \frac{1}{s} L^{-1} K + \gamma \bar{1} \right) = Y = Bx\]

with our raw feature counts, \(K\), distributed as a negative binomial \(K \sim \mathcal{NB}(\mu, \alpha)\).

The variables in the equation are as follows:

Knowns:

Variable Value Units Description
\(L \in \mathbb{R}^{F \times F}\) \(\begin{bmatrix} l_1 & 0 & \dots & 0 \\ 0 & l_2 & \ddots & 0 \\ \vdots & \ddots & \ddots & \vdots \\ 0 & \dots & 0 & l_F \\ \end{bmatrix}\) bp The genomic length of each feature
\(\gamma \in \mathbb{R}\) 0.5, for example counts per bp Logarithm offset
\(x \in \mathbb{R}^{C + 1}\) \(\begin{bmatrix} 1 \\ x_1 \\ \vdots \\ x_C \end{bmatrix}\) “condition” units The design matrix containing information about the conditions

Observables:

Variable Units Description
\(K \in \mathbb{N}^F\) Counts Read counts for each feature (one vector for each of the \(n\) samples)
\(s \in \mathbb{R}\) Unitless Sequencing depth normalization scale factor (derived from \(K\))

Unknowns:

Variable Units Description
\(\mu \in \mathbb{R}^F\) Counts Mean count vector for each feature
\(\alpha \in \mathbb{R}^F\) 1/ Counts Dispersion vector for each feature
\(B = \begin{bmatrix} \beta_{1, 0} & \beta_{1, 1} & \dots & \beta_{1, C} \\ \beta_{2, 0} & \beta_{2, 1} & \dots & \beta_{2, C} \\ \vdots & \vdots & \ddots & \vdots \\ \beta_{F, 0} & \beta_{F, 1} & \dots & \beta_{F, C} \\ \end{bmatrix} \in \mathbb{R}^{F\times(C + 1)}\) “concentration” per unit “condition” The effect matrix containing information about the effect of each condition on the concentration of each feature

This model holds for each of our \(n\) samples, so we have \(n\) versions of the equation above. All in all, this leaves us with a system of equations with \(F \times n\) knowns and \(F \times (C + 3)\) unknowns. To have a hope of solving this system, we need at least \(n \ge C + 3\) samples. Moreover, to separate out each condition in our design matrix, we need a design matrix of full rank, across all samples.

Once we fit our model to our data, we need to assess whether the coefficients in \(B\) are significantly different from 0. This involves performing some hypothesis testing, often a Wald test, for each of our coefficients. To do this well requires accurately estimating the variance of each feature, which is where our discussion of appropriate sample sizes comes in.

Fitting the model is harder than it seems, and is worth its own post. Suffice to say that Monte Carlo and Bayesian shrinkage methods will be very helpful. We can conclude this discussion here, though, now that we’ve derived a statistical model and procedure for answering our biological question.

Conclusions

We’ve taken a purely biological problem (finding biological differences between test conditions), converted it into a mathematical problem (finding non-zero coefficients in an equation), and developed a statistical framework for arriving at answers we can be reasonably confident in (hypothesis testing on the coefficients).

\[\begin{align} \text{What are the biological diff} & \text{erences between our conditions?} \\ & \Bigg\Updownarrow \\ \text{Which } \beta_{f,c} & \ne 0 \forall c \ne 0? \\ & \Bigg\Updownarrow \\ \text{Hypothesis test on } \beta_{f,c} \forall c \ne & 0 \text{ after fitting a GLM to the data} \\ \end{align}\]

This covers the motivation and derivation for many commonly-used statistics models for differential analysis in genomics. These ideas presented here come from models like DESeq2, edgeR, limma, and sleuth. There is more to every facet of every comment I’ve made here, and different design choices made by different research groups. To actually perform differential analysis, we need to fit the model (namely \(B\), \(\mu\), and \(\alpha\)) to the data, and determine which elements of \(B\) are statistically significantly different from 0. But fitting the model is a complicated computational task in and of itself, and is a topic I’ll leave for another time.

The purpose of this post was to make the models that we use for performing differential analysis more approachable. The math we use here is not frivolous for no reason. It comes from simple ideas extracted from the problem, represented in a mathetmatical way. By outlining the concepts of what we have and what we want, hopefully, this concept is easier to understand, as a whole.

Footnotes

  1. Like all things in science, each of these steps can be tinkered with. While short-read sequencing is by far the most common in academia, long-read sequencing technologies exist, which makes alignment of long reads look very different from alignment of short reads. There are also ways of quantifying reads that don’t explicitly use alignment. See the pseudo-alignment algorithm defined in the Kallisto paper

  2. There are also ways to tag molecules prior to sequencing that can give you information about which cell they come from. This makes the measurements still relative to some lower detection threshold, but a threshold defined per cell, not per collection of cells. See some single-cell sequencing papers from the mid 2010s to get an idea on how this is done. A few examples are Cusanovich et al., Science, 2015, Tang et al., Nature Methods, 2009, and Kolodziejczyk et al., Molecular Cell, 2015

  3. We don’t need an explicit value for \(\gamma\) right now, but we should pick one that stabilizes small counts but doesn’t swamp them. Often, \(\gamma = 0.5\), or something close to 0.5, is selected. 

  4. “All models are wrong, but some are useful”, George Box