The Central Limit Theorem and estimation of variance

Published: October 08, 2020   |   Updated: November 12, 2020   |   Read time:

Tagged:

Image Attribution:

In my previous post, I discussed mathematical notation and how it can help in understanding different statistical concepts. The end of that post discussed why we use estimators for the population mean and variance, and how we can know those estimators are useful. In this post, I want to apply some of those rules to talk about one of the most central theorems in all of statistics: the Central Limit Theorem.

Setting the scene

Consider our scenario from before. We have \(n\) independent, identically-distributed random variables \(X^{(1)}, ..., X^{(n)} \sim \mathcal{X}\), with mean \(\mu\) and variance \(\sigma^2\), both of which are finite but unknown. We also don’t know what the distribution \(\mathcal{X}\) is. We don’t need to put any limitations on the distribution itself, other than the finite mean and variance.

We can define a random variable, \(M\), as

\[M = \frac{1}{n} \sum_{i = 1}^n X^{(i)}\]

It is straightforward to show that \(\mathbb{E}[M] = \mu\) and \(\mathbb{V}[M] = \frac{\sigma^2}{n}\) for any value of \(n\).

\[\begin{align*} \mathbb{E}[M] &= \mathbb{E} \left[ \frac{1}{n} \sum_{i=1}^n X^{(i)} \right] \\ &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}\left[ X^{(i)} \right] \\ &= \frac{1}{n} \sum_{i=1}^n \mu \\ &= \mu \\ \end{align*}\] \[\begin{align*} \mathbb{V}[M] &= \mathbb{V} \left[ \frac{1}{n} \sum_{i=1}^n X^{(i)} \right] \\ &= \frac{1}{n^2} \sum_{i=1}^n \mathbb{V} \left[ X^{(i)} \right] \\ &= \frac{1}{n^2} \sum_{i=1}^n \sigma^2 \\ &= \frac{\sigma^2}{n} \\ \end{align*}\]

But \(M\), being a random variable, has some distribution, \(\mathcal{M}\). Can we say what \(\mathcal{M}\) is? And what relationship does \(\mathcal{M}\) have with \(\mathcal{X}\)? These questions are the inspiration for the Central Limit Theorem.

The Central Limit Theorem

The theorem itself can be stated in a number of ways (check out the Wikipedia page for different versions of the CLT). But with the notation above, we can simply state the CLT as the following1:

\[\lim_{n \rightarrow \infty} \mathcal{M} = \mathcal{N} \left( \mu, \frac{\sigma^2}{n} \right)\]

While we know the values of \(\mathbb{E}[M]\) and \(\mathbb{V}[M]\) for any value of \(n\), we have no idea what its distribution, \(\mathcal{M}\), might be. There are an uncountable number of distributions that will have these two properties, which doesn’t exactly narrow it down. The first critical implication of the CLT is that \(\mathcal{M}\) converges to a single distribution in the limit, and that distribution is a normal distribution. The second critical implication of the CLT is that it doesn’t matter what \(\mathcal{X}\) is. As long as \(n\) is sufficiently large, \(\mathcal{M}\) is sufficiently close to a normal distribution.

For this idealized case, we can provide a simple proof. The proof is provided on the Wikipedia page, but I’ll walk through it, since we’re going to use the ideas of this proof later on.

A proof of the CLT

For this proof, we’ll need two main tools: Taylor series and characteristic functions of random variables. The idea of the proof is that we’re going to calculate the characteristic function of \(M\), take the Taylor approximation of that function about \(\mu\), then show that this approximation converges to the characteristic function of a normal distribution as \(n \rightarrow \infty\) 2.

The characteristic function, \(\phi_M\), of the random variable, \(M\), is defined as

\[\begin{align*} \phi_M(t) &= \mathbb{E}[e^{itM}] \\ &= \mathbb{E} \left[ \exp\left\{\frac{it}{n} \sum_{i = 1}^n X^{(i)} \right\} \right] \\ &= \prod_{i = 1}^n \mathbb{E} \left[ \exp\left\{\frac{it}{n} X^{(i)} \right\} \right] \\ &= \mathbb{E} \left[ \exp\left\{\frac{it}{n} X^{(i)} \right\} \right]^n \\ \end{align*}\]

We can take a Taylor expansion of \(\exp \left\{ \frac{it}{n} X^{(i)} \right\}\) about the point \(X^{(i)} = \mu\).

\[\begin{align*} \exp \left\{ \frac{it}{n} X^{(i)} \right\} &= e^{it\mu / n} \sum_{k = 0}^\infty \frac{1}{k!} \left( \frac{it}{n} \right)^k \left(X^{(i)} - \mu \right)^k \\ &\approx e^{it\mu / n} \left[ 1 + \frac{it}{n} \left(X^{(i)} - \mu \right) - \frac{t^2}{2n^2} \left(X^{(i)} - \mu \right)^2 \right] \\ \end{align*}\]

Plugging that into our formula for \(\phi_M\), we get

\[\begin{align*} \phi_M(t) &\approx e^{it\mu} \mathbb{E} \left[ 1 + \frac{it}{n} \left(X^{(i)} - \mu \right) + \left( \frac{-t^2}{2n^2} \right)^k \left(X^{(i)} - \mu \right)^2 \right]^n \\ &= e^{it\mu} \left( 1 + \frac{it}{n} \mathbb{E} \left[X^{(i)} - \mu \right] - \frac{-t^2}{2n^2} \mathbb{E} \left[ \left( X^{(i)} - \mu \right)^2 \right] \right)^n \\ &= e^{it\mu} \left( 1 + 0 - \frac{t^2\sigma^2}{2n^2} \right)^n \\ &= e^{it\mu} \left( 1 + \frac{1}{n} \left[ \frac{-t^2\sigma^2}{2n} \right] \right)^n \\ \end{align*}\]

This is true \(\forall n\). Playing a little bit loosely with limits3, we can see that \(\lim_{n \rightarrow \infty} \left( 1 + \frac{1}{n} \left[ \frac{-t^2\sigma^2}{2n} \right] \right)^n = \exp \left\{ -\frac{1}{2} t^2 \sigma^2 / n \right\}\). So our final approximation to \(\phi\) as \(n \rightarrow \infty\) is

\[\phi_M(t) = e^{it\mu - \frac{1}{2} t^2 \left( \sigma^2 / n \right)}\]

Comparing this characteristic function with the table of example characteristic functions from Wikipedia shows up that this is the same as a normal distribution with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\).

Extending the CLT to the sample variance

We can alternatively think of \(M\) as an estimator for the population mean, since \(M\) is calculated from a sample of the original distribution, \(\mathcal{X}\). In my last post, we derived a new random variable, \(S\), which can be thought of as an estimator for the population variance. We showed previously that \(\mathbb{E}[S] = \sigma^2\) 4.

Like \(M\), the mean of \(S\) is fixed for all values of \(n\). But how much does it vary? What does the underlying distribution, \(\mathcal{S}\), look like? Can we derive a CLT-like theorem for the distribution that \(\mathcal{S}\) converges to as \(n \rightarrow \infty\)?

First, let’s attempt to find \(\mathbb{V}[S]\).

\[\begin{align*} \mathbb{V}[S] &= \mathbb{E} \left[ (S - \sigma^2)^2 \right] \\ &= \mathbb{E} \left[ S^2 \right] - \sigma^2 \\ \mathbb{E} \left[ S^2 \right] &= \frac{1}{(n - 1)^2} \mathbb{E} \left[ \left\{ \sum_{i = 1}^n (X^{(i)} - M)^2 \right\} \left\{ \sum_{j = 1}^n (X^{(j)} - M)^2 \right\} \right] \end{align*}\]

You can see that this expectation is going to involved fourth powers of \(X^{(i)}\). If the fourth moment of \(\mathcal{X}\) isn’t defined, then we can’t make any statement about the value, or even the finiteness, of \(\mathbb{V}[S]\). This calculation will also get complicated because of the implicit definition of \(M\) and the covariance between \(X^{(i)}\) and \(M\). While \(X^{(i)} \perp X^{(j)} \forall i \ne j\), products of the form \(\left( X^{(i)} \right)^r \left( X^{(j)} \right)^s M^t\) aren’t easily separable.

But this isn’t the end of the world. If we assume that the fourth moment of \(\mathcal{X}\) is finite, then we can avoid this complicated calculation and use the CLT proof from above to arrive at what \(\mathcal{S}\) will converge to as \(n \rightarrow \infty\).

Convergence of the sample variance

Let’s play the characteristic function game again. The characteristic function of \(S\) is

\[\begin{align*} \phi_S(t) &= \mathbb{E} \left[ e^{itS} \right] \\ &= \mathbb{E} \left[ \exp\left\{ \frac{it}{n - 1} \sum_{i = 1}^n \left( X^{(i)} - M \right)^2 \right\} \right] \\ &= \mathbb{E} \left[ \exp\left\{ \frac{it}{n - 1} \left( X^{(i)} - M \right)^2 \right\} \right]^n \\ \end{align*}\]

The expression inside the expectation is a multivariate function (it’s a function of both \(X{(i)}\) and \(M\)), so we need to take a multivariate Taylor approximation. To second order, the multivariate Taylor approximation is

\[f(x) \approx f(a) + \nabla f(a) \cdot (x - a) + \frac{1}{2} (x - a)^T H(a) (x - a)\]

where \(\nabla\) is the gradient operator, \(H\) is the Hessian matrix, and \(x\) and \(a\) are vectors. Expanding the above around \(X^{(i)} = \mu\) and \(M = \mu\) gives us

\[\begin{align*} \phi_S(t) &= \mathbb{E} \left[ \exp\left\{ \frac{it}{n - 1} \left( X^{(i)} - M \right)^2 \right\} \right]^n \\ &\approx \mathbb{E} \left[ 1 + \frac{it}{n - 1} \left( X^{(i)} - M \right)^2 \right]^n \\ &= \left( 1 + \frac{it}{n - 1} \mathbb{E} \left[ \left( X^{(i)} - M \right)^2 \right] \right)^n \\ &= \left( 1 + \frac{it}{n - 1} \frac{n - 1}{n} \sigma^2 \right)^n \\ &= \left( 1 + \frac{1}{n} (it\sigma^2) \right)^n \\ \end{align*}\] \[\begin{align*} \lim_{n \rightarrow \infty} \phi_S(t) &= \lim_{n \rightarrow \infty} \left( 1 + \frac{1}{n} (it\sigma^2) \right)^n \\ &= e^{it\sigma^2} \end{align*}\]

Again, we can compare our derived function with the example characteristic functions. It’s easy to spot that \(\phi_S\) is the same as the characteristic function for a degenerate distribution. Essentially, we arrive at something that isn’t a random variable at all. The value that this random variable takes on is deterministically \(\sigma^2\) in the limit that \(n \rightarrow \infty\).

This is a great sign. As long as our fourth moments of \(\mathcal{X}\) are controlled, our estimator for the population variance will converge to the population variance, certainly, regardless of what distribution \(\mathcal{X}\) is. This might be a big assumption, though, since many common distributions like the Cauchy distribution doesn’t have a fourth moment (it doesn’t even have a mean). However, this is convergence to the population variance is assuring for well-behaved distributions, like the normal and binomial distributions.

Conclusion

The CLT is so central to statistics and experimental design in science. But there’s so much more to the theorem than just the result. The idea behind a proof of the CLT can be extended to talk about other random variables than just the sample mean. However, the proof is often too complicated for many who just want to apply statistics to their field of interest. I hope that by walking through the proof here, you can get the gist of the proof, see where else we can use it, and what implications it has for things like the sample variance.

I also hope that using the rules for notation I laid out earlier has made the separation between random variables and distributions clearer. This confounding of the two ideas often makes it difficult to talk about the CLT, since talking about convergence is already tough enough without factoring in the subtle differences between random variables and distributions.

Update: 2020-11-12

I used an asymptotic argument here, which is useful. But usually having an analytic solution for any \(n\) is more useful, because we can gain some more insight at each \(n\). This post by Randy Lai has a good description of how to think about \(\mathbb{V}[S]\).

Briefly, if the distribution \(\mathcal{X}\) has kurtosis \(\kappa\), then

\[\mathbb{V}[S] = \frac{\sigma^4}{n(n-1)} \left[ (n - 1) \kappa - n + 3 \right]\]

For a normal distribution, \(\kappa = 3\), and we recover the well-known formula for the variance of the same variance, \(\mathbb{V}[S] = \frac{2\sigma^4}{n - 1}\). Regardless of whether \(\mathcal{X}\) is normal or not, the variance of both the sample mean, \(\bar{X}\) and sample variance, \(S\), have variances that are \(\mathcal{O}(\frac{1}{n})\). In the asymptotic limit, however, they approach different distributions.

Footnotes

  1. I’m being a little lax with the limits here. We can’t take a limit with \(n\) that leaves \(n\) in the resulting expression. The more rigorous way to define it is to define a new variable \(Z = \frac{M - \mu}{\sigma / \sqrt{n}}\). Then the CLT can be stated simply as \(\lim_{n \rightarrow \infty} \mathcal{Z} = \mathcal{N}(0, 1)\). 

  2. For simplicity, we’re assuming that \(\mathcal{X}\) is a univariate distribution. The argument above will still for for a multivariate distribution, but we have to replace the variable \(t\) with the vector variable \(t^T\), everywhere. It makes the notation a bit more confusing, but everything works out in the end. 

  3. Again, we can’t take a limit with \(n\) that leaves \(n\) in the resulting expression. To be rigorous, we should again use \(Z = \frac{M - \mu}{\sigma / \sqrt{n}}\), then find the characteristic function of \(Z\), \(\phi_Z\). Using the same method as above, you can show that \(\phi_Z(t) = e^{-\frac{1}{2} t^2}\), which is the characteristic function of \(\mathcal{N}(0, 1)\). By linearity of the normal distribution, we then have that \(\mathcal{M} \approx \mathcal{N}(\mu, \frac{\sigma^2}{n})\) for any finite \(n\). 

  4. These types of estimators, where the expected value of the estimator is the true value you’re attempting to estimate, are called unbiased estimators. We showed in in my last post that the estimator \(R = \frac{1}{n} \sum_{i = 1}^n \left( X^{(i)} - M \right)^2\) is biased becase \(\mathbb{E}[R] = \frac{n}{n - 1} \sigma^2\). While we often want unbiased estimators, there are some cases where biased ones are more helpful