The cleverness behind Tajima's D statistic

Published: December 04, 2018   |   Read time:

I recently finished reading an important paper in evolutionary theory: Fumio Tajima’s 1989 paper, “Statistical Method for Testing the Neutral Mutation Hypothesis by DNA Polymorphism” 1. I want to highlight how clever the derivation of Tajima’s $$D$$ statistic is, and a great idea he puts forward in his 1989 paper.

Context

The theory of evolution had been in the scientific mindset for 130 years, the knowledge of DNA as the transmitter of heritable information just over 40, and the ability to sequence DNA for just over a decade. The exploration of selective pressures and neutral genetic variation was an active area of research, and Tajima had recently published a groundbreaking paper in 1983, titled “Evolutionary Relationship of DNA Sequences in Finite Populations”2.

This 1989 paper is a follow-up to that 1983 paper, and explores the ways in which we can test the neutral mutation hypothesis in a population of individuals.

A statistical hypothesis test involves the comparison of an observed test statistic to its theoretical distribution under your hypothesis of interest (the null hypothesis). To test whether a population exists in the absence of selective pressure (i.e. whether a population is full of neutral mutations or not), we need to develop a test statistic that fits this measurement, and find its corresponding distribution under the null hypothesis.

I’m only going to focus on one of the main results of this paper because I want to emphasize an extremely clever technique that make this whole hypothesis test possible. Let’s jump right in to see how it’s done.

Deriving the test statistic

The paper starts off by introducing genetic variation in natural populations, but the key motivation for this first section isn’t until the 5th paragraph:

The remarkable and important difference between the number of segregating sites and the average number of nucleotide differences is the effect of selection.

This is the topic that this paper is about, and this is what effect Tajima is ultimately trying to investigate. To do this, he introduces two observables: the number of distinguishing sites between mutants, denoted by $$S$$, and the average number of pairwise nucleotide differences between DNA sequences of different mutants, which I’ll denote by $$\bar{k}$$. Both of these estimators have the property that their expected values are directly proportional to some easily measured value, $$M \equiv 4N\mu$$, where $$N$$ is the effective population size and $$\mu$$ is the (assumed) constant mutation rate per generation per DNA sequence under observation. Namely,

\begin{align} \mathbb{E}\left[ S \right] &= a_1 M \\\\ \mathbb{E}\left[ \bar{k} \right] &= M \end{align}

We now have two estimators for the same quantity (since $$a_1$$ is a constant, $$\frac{S}{a_1}$$ is an estimator for $$M$$). He demonstrates how there exists a bias between these two estimates:

Since the number of [distinguishing] sites ignores the frequency of mutants, this number might be strongly affected by the existence of deleterious mutants. On the other hand, the existence of deleterious mutants with low frequency does not affect the average number of nucleotide differences very much, since in this case the frequency of mutants is considered. In other words, if some of the mutants observed have selective effects, then the estimate of $$M$$ obtained from [Eqn.] (5) by using the number of [distinguishing] sites may not be the same as the average number of nucleotide differences which also is the estimate of $$M$$.

This is where we will take advantage of these two different estimators. They do not share the same bias, and the bias between them is a proxy for the effect of selection on these mutations.

Let us, then, consider a new quantity, $$d \equiv \bar{k} - \frac{S}{a_1}$$. What do we know, and what do we not know, about $$d$$? We know that $$\mathbb{E}[d] = 0$$, by construction. It would be extremely convenient, and make things mathematically simple, if we could work with a test statistic with mean 0 and variance 1, but we don’t yet know the variance of $$d$$. However, if we estimate $$Var[d]$$ (call this estimate $$v$$), then we can normalize $$d$$ and create a new statistic. Let

$$$D \equiv \frac{d}{\sqrt{v}}$$$

Then $$\mathbb{E}[D] = 0$$ and $$Var[D] = 1$$. This is the famous Tajima’s $$D$$ statistic.

He then goes on to show that $$D$$ is bounded and attempts to fit a reasonable theoretical distribution to this statistic via simulations, with the constraint that the mean is 0 and the variance is 1. By doing this, he turned two estimators that weren’t specifically designed for this question into a statistic that produces an hypothesis testing framework for an entirely new question. This is genius, and is something that I love about this paper.

Summary

I want to emphasize how we got here. We started with two estimators that estimated the same quantity (but wasn’t the quantity we actually wanted to study). Tajima then showed that an unshared bias between these estimators is the effect of selection on the population (which is the quantity we wanted to study). By combining these two estimators in such a way that preserves that unshared bias, we now have a test statistic that is a proxy for the effect of selection on the population as a whole.

Let me spell this out more explicitly and more generally: by selecting two estimators for the same value, you can generate a test statistic for an unrelated quantity you’re actually interested in out of ones that you aren’t.

• Take 2 different estimators of the same unknown quantity, that’s not your quantity of interest
• Each estimator may have its own set of biases, but taking a difference of these two will negate all of the shared biases in the resulting statistic
• Unshared biases will remain in this new statistic and will influence the shape of the underlying distribution
• Deriving or fitting an underlying distribution for this new statistic creates an hypothesis testing framework for the quantity of interest

This is the essence that underlies the mathematical and intuitive theory of this paper, and it’s an ingenious approach that plays out very well. It’s extremely clever and something I haven’t seen done before.

This statistic is still used in research today, although this paper has inspired many derivative works, such as Fay and Wu’s $$H$$ statistic3 (this paper nicely describes Tajima’s argument, and does so very briefly).

For the scope of this post, I’ve left out the mathematical derivation of $$Var[d]$$, and how to define a good estimator for it. But because I like mathematical derivations and wanted to provide some commentary on the derivation itself, I’ve added it as an appendix to this post, below.

Mathematical derivation

Assumption 1: Assume a random mating population of $$N$$ diploid individuals without selection and without recombination between DNA sequences.

Assumption 2: The infinite sites assumption (i.e. the number of sites on a DNA sequence is so large that new mutations must take place at a new site, and not overwrite an existing mutation).

Under these assumptions, it has been shown that $$S$$ has the following properties:

\begin{align} \mathbb{E}[S] &= a_1 M \\\\ Var[S] &= a_1 M + a_2 M^2 \\\\ a_1 &= \sum_{i=1}^{n - 1} \frac{1}{i} \\\\ a_2 &= \sum_{i=1}^{n - 1} \frac{1}{i^2} \end{align}

where $$n$$ is the number of DNA sequences under observation, as mentioned above.

Similarly, for $$\bar{k}$$, we have that

\begin{align*} \mathbb{E}[\bar{k}] &= M \\\\ Var[\bar{k}] &= b_1 M + b_2 M^2 \\\\ b_1 &= \frac{n+1}{3(n-1)} \\\\ b_2 &= \frac{2(n^2 + n + 3)}{9n(n - 1)} \end{align*}

For any random variable, $$V$$, $$Var[V] = \mathbb{E}[V^2] - \mathbb{E}[V]^2$$. Thus, we get the following:

$Var[\bar{k}] = \mathbb{E}[\bar{k}^2] - M^2 \\\\ Var\left[\frac{S}{a_1}\right] = \frac{1}{a_1^2}Var[S] = \frac{1}{a_1^2}(\mathbb{E}[S^2] - a_1^2 M^2) = \mathbb{E}\left[\frac{S^2}{a_1^2}\right] - M^2$

Deriving the variance for $$d$$

Since $$\mathbb{E}[d] = 0$$, we have that

\begin{align} Var[d] &= \mathbb{E}[d^2] \\\\ &= \mathbb{E}\left[\left(\bar{k} - \frac{S}{a_1}\right)^2\right] \\ &= \mathbb{E}\left[\bar{k}^2 - 2\bar{k}\frac{S}{a_1} + \frac{S^2}{a_1^2}\right] \\\\ &= \mathbb{E}[\bar{k}^2] - \frac{2}{a_1}\mathbb{E}[\bar{k}S] + \mathbb{E}\left[\frac{S^2}{a_1^2}\right] \\\\ &= (\mathbb{E}[\bar{k}^2] - M^2) - \left(\frac{2}{a_1}\mathbb{E}[\bar{k}S] - 2M^2\right) + \left(\mathbb{E}\left[\frac{S^2}{a_1^2}\right] - M^2\right) \\ &= Var[\bar{k}] - \frac{2}{a_1}(\mathbb{E}[\bar{k}S] - a_1M^2) + \frac{1}{a_1^2}Var[S] \\\\ \end{align}

To solve the middle term, let’s look at the covariance of $$\bar{k}$$ and $$S$$.

\begin{align} Cov[\bar{k}, S] &= \mathbb{E}[(\bar{k} - M)(S - a_1 M)] \\\\ &= \mathbb{E}[\bar{k}S - (a_1 \bar{k} + S)M + a_1 M^2] \\\\ &= \mathbb{E}[\bar{k}S] - (a_1 M + a_1 M)M + a_1 M^2 \\\\ &= \mathbb{E}[\bar{k}S] - a_1 M^2 \end{align}

This, conveniently, is our middle term. Thus, we’re left with

$$$Var[d] = Var[\bar{k}] - \frac{2}{a_1}Cov[\bar{k}, S] + \frac{1}{a_1^2} Var[S]$$$

It can be shown that $$Cov[\bar{k}, S] = M + \left(\frac{1}{2} + \frac{1}{n}\right)M^2$$, which gives us the following

\begin{align} Var[d] &= Var[\bar{k}] - \frac{2}{a_1}Cov[\bar{k}, S] + \frac{1}{a_1^2} Var[S] \\\\ &= (b_1 M + b_2 M^2) - \frac{2}{a_1}\left( M + \left(\frac{1}{2} + \frac{1}{n}\right)M^2 \right) + \left(\frac{1}{a_1} M + \frac{a_2}{a_1^2} M^2 \right) \\\\ &= \left(b_1 - \frac{2}{a_1} + \frac{1}{a_1}\right) M + \left(b_2 + \frac{a_2}{a_1^2} - \frac{1}{a_1} - \frac{2}{n a_1}\right) M^2 \\\\ &= c_1 M + c_2 M^2 \\\\ \end{align}

We now have a value for $$Var[d]$$.

Deriving an estimator for $$D$$

We can calculate $$c_1$$ and $$c_2$$ explicitly, since they are functions of known quantities. Unfortunately, $$M$$ and $$M^2$$ are unknown values that we need to estimate. Conveniently, we have 2 estimators for $$M$$ ($$S$$ and $$\bar{k}$$). We could estimate $$M^2$$ as the square of our estimate for $$M$$, but it may be a biased estimate. To calculate an unbiased estimate for $$M^2$$, we want to rewrite a known equation that includes $$M^2$$ such that the result is the expected value of some combination of other variables (i.e. $$M^2 = \mathbb{E}[f(S, \bar{k})]$$).

For $$n \geq 3$$, we have that $$Var[S] < Var[\bar{k}]$$, so let’s derive a relationship between $$S$$ and $$M$$.

\begin{align} Var[S] &= \mathbb{E}[S^2] - \mathbb{E}[S]^2 \\\\ \mathbb{E}[S^2] &= Var[S] + \mathbb{E}[S]^2 \\\\ &= (a_1 M + a_2 M^2) + (a_1 M)^2 \\\\ &= (a_1 M) + (a_1^2 + a_2) M^2 \\\\ &= \mathbb{E}[S] + (a_1^2 + a_2) M^2 \\\\ M^2 &= \frac{\mathbb{E}[S^2] - \mathbb{E}[S]}{a_1^2 + a_2} \\\\ &= \mathbb{E}\left[\frac{S^2 - S}{a_1^2 + a_2}\right] \\\\ \end{align}

If we let $$m = \frac{S^2 - S}{a_1^2 + a_2}$$, then we know, by construction, that $$m$$ is an unbiased estimator for $$M^2$$ (i.e. $$\mathbb{E}[m] = M^2$$).

Our estimate $$v$$, for $$Var[d]$$ becomes

\begin{align} v &= c_1 M + c_2 M^2 \\\\ &= \frac{c_1}{a_1} S + \frac{c_2}{a_1^2 + a_2} S(S-1) \\\\ &= e_1 S + e_2 S(S-1) \end{align}

and our estimate $$\hat{D}$$, for $$D$$ is then

$$$\hat{D} = \frac{\bar{k} - \frac{S}{a_1}}{\sqrt{e_1 S + e_2 S(S - 1)}}$$$

We can now calculate $$\hat{D}$$, for an observable set of mutations, and know that our estimator is unbiased, with mean 0 and variance 1.