Notation and thinking in math

Published: October 04, 2020   |   Updated: April 21, 2021   |   Read time:

Tagged:

Picture your favourite movie featuring some “smart” person. Where are they standing or sitting? What’s in the background? Chances are there’s a blackboard or whiteboard filled with equations or diagrams.

Good Will Hunting scene with Will at the blackboard solving a graph theory problem -100%

Why is this? Because mathematical notation is a signature of math. It doesn’t matter where you find it, you can always recognize it, even if you don’t know what it means. It’s why movies and media use the “random unrelated equations filling up a blackboard” as a backdrop, even if it’s not central to the story. It’s an immediately recognizable and visual cue to say someone is smart or working on something hard.

But beyond being a media trope, mathematical notation is important, as are all forms of communication with abstract ideas 1. The mathematician Terry Tao gives a great explanation as to why different people use different mathematical notation for the same thing, and has something important to say about what makes good mathematical notation (emphasis mine).

Mathematical notation in a given mathematical field \(X\) is basically a correspondence

\[\text{Notation} : \{\text{well-formed expressions}\} \rightarrow \{\text{abstract objects in }X\}\]

between mathematical expressions (or statements) on the written page (or blackboard, electronic document, etc.) and the mathematical objects (or concepts and ideas) in the heads of ourselves, our collaborators, and our audience. A good notation should make this correspondence Notation (and its inverse) as close to a (natural) isomorphism as possible.

All of the different notation in the MathOverflow post above didn’t just come from nowhere. Certain people made decisions for the different types of notation and they’ve stuck around because certain other people liked them. But that does mean that other people learning about something new in math for the first time will have to look at something and learn the notation along with the content. If you have a million ways to say the same thing, it’s confusing. Communication becomes unclear behind the cloud of same-same-but-different notation. This, in turn, makes the concepts themselves unclear.

Unclear notation in statistics

A key example where people get confused is statistics. It’s not uncommon to see sentences like this in statistics textbooks:

Let \(X\) be some random variable. What is the probability that \(X <= x\) for some \(x\)? If \(X < x\) is known, what is the expected value of \(E_X\left[ X | X < x\right]\)?

What’s with all the \(X\)s? Can’t we use something else? This gets worse when we start talking about multivariate statistics. Matrix elements get confused for independent samples, total distributions get confused with marginal distributions, etc.

To try and adress this confusion, I now want to talk about some notation for stats and some inspiration for how to think about this notation.

A notation for stats inspired by concepts

I’m not going to stray far from typical math notation. All of this is writable with \(\LaTeX\). But the central ideas I want the notation to focus on are:

  1. Random variables
  2. Observations of random variables
  3. Distributions of random variables
  4. Operators on random variables

Random variables

As you can see, this notation stems from the centrality of random variables within statistics. Random variables are the things we think about, model, and try to measure. Everything else follows from what we want to know about random variables. Because of their centrality, we will likely be referring to them often, so we want to make a notation for them that is as simple as possible. For this reason, picking a single letter is probably best.

Notation Rule 1: Random variables are represented by single capital letters. Calculated values derived from random variables are also single capital letters.

In many applications, random variables can be scalars or vectors/matrices/tensors of discrete or continuous values. We need to be able to reference elements of the matrices clearly. But we also often run into situations where we have multiple random variables that relate to each other in some way. Often, it’s convenient to index them in a similar way. But we can’t confuse element indexing with multiple variable indexing. Subscripts and superscripts are often used for these scenarios. I will also try to avoid confusing indexing with taking powers of random variables, since those appear often, as well. This inspires the next rule.

Notation Rule 2: Random variable elements are indexed by subscripts. Multiple, related random variables are indexed with superscripts in parentheses.

For example, if we have \(n\) random variables that are all matrix-valued, we can denote the \((i, j)\)-th element of the \(k\)-th random variable as \(X^{(k)}_{ij}\). That covers random variables. The next important consideration are observations.

Observations

Observations are also very common, so we want this notation to be simple. But observations are tightly related to the random variable they come from. So we need a strong link between the observation and the random variable that immediately tells us what we’re observing.

Notation Rule 3: Observations of random variables are denoted by single, lowercase versions of the random variable they correspond to. Calculated values derived from observations are also lowercase letters.

This makes it immediately clear that \(x^{(k)}\) is an observation of the random variable \(X^{(k)}\). The simplicity of the lowercase also makes it easy to extend variable/element indexing to observations.

Distributions

Random variables usually have some specific properties about them. Of we often assume that a certain random variable will look like something we know. But we don’t work with distributions, themselves, directly. We work with distributions through random variables, and sometimes multiple random variables can come from the same distribution. So there should still be a link between a random variable and its distribution, but something a bit more mysterious and maybe not a strong as the one between random variables and observations.

Notation Rule 4: Distributions are denoted by cursive or \mathcal letters. For convenience, they can be the mathcal version of letter represnting a random variable.

If we have a random variable \(X\), then we can easily say that its distribution is \(\mathcal{X}\), or \(X \sim \mathcal{X}\). If \(\mathcal{X}\) is some known distribution, like a normal or beta, then we can say \(\mathcal{X} = \mathcal{N}(\mu, \sigma)\). This also makes it clear that the set of random variables \(\{ X^{(1)}, X^{(2)}, ..., X^{(n)} \}\) can all have the same distribution (\(X^{(i)} \sim \mathcal{X} \forall i\)), which is often the model we work with.

Operators

Operators are tricky. Things like expectation or the probability of an event have to work on discrete, continuous, and even discontinuous variables. They shouldn’t be tied to a single variable. But they’re also more special than functions. They need to deal with condition events, logical operators, and more.

Notation Rule 5: Special operators are denoted with the \mathbb font and square brackets.

If we have a random variable \(X\), then its expected value is given by \(\mathbb{E}[X]\) and its variance is \(\mathbb{V}[X]\). The probability that \(X\) is less than or equal to some value, \(z\), is denoted by \(\mathbb{P}[X \le z]\).

Benefits of using this notation

These are the most important rules, in my opinion. They isolate these important statistical concepts in clear and distinct notations. To demonstrate their utility, I’ll review two common scenarios that many people learning statistics trip up on.

What is the variance of the sample mean

Consider some random variables \(X^{(1)}, ..., X^{(n)} \sim \mathcal{X}\) that are all independent. These can be \(n\) people who’s height you’re measuring from a population, or \(n\) bacteria cells you’re measuring the radius of in a Petri dish. Let’s say you make an observation of \(n\) subjects, so you have values for your observations \(x^{(1)}, ..., x^{(n)}\).

What is the mean of your sample? We know that the mean, \(m\), of a set of values is

\[m = \frac{1}{n}\sum_{i=1}^n x^{(i)}\]

Note the choice of the lowercase \(m\), since its value is built out of the observations, \(x^{(i)}\) (Rule 3). This means that if we make another observation, our value for \(m\) will be different. The value of \(m\) is itself random, and dependent on our observations. From using this notation, we can see that the value we calculate is actually random.

If we think about the notation we’re using, what does \(M\) or \(\mathcal{M}\) correspond to? Clearly \(M\) is the random variable of the sample mean, and \(\mathcal{M}\) is the distribution of that random variable. Because the notation is a strong representation of the ideas we’re thinking about, playing around with the notation is equivalent to playing around with the ideas. This notation gives us a language to describe what we’re looking at and thinking about.

The train of thought brought about by using a clear notation -100%

This leads us to immediate questions like “what are \(\mathbb{E}[M]\) and \(\mathbb{V}[M]\)”? or “how can I define \(M\) in terms of random variables”? To answer the latter question, the easiest way to formulate \(M\) is to replace all the lowercase letters with uppercase. Our random variable, \(M\), is then defined by the equation

\[M = \frac{1}{n} \sum_{i=1}^n X^{(i)}\]

We can use this definition to answer the former question about mean and variance, using the properties of \(\mathbb{E}\) and \(\mathbb{V}\) we know. If we define \(\mu = \mathbb{E}[X^{(i)}]\) and \(\sigma^2 = \mathbb{V}[X^{(i)}]\), then

\[\begin{align*} \mathbb{E}[M] &= \mathbb{E} \left[ \frac{1}{n} \sum_{i=1}^n X^{(i)} \right] \\ &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}\left[ X^{(i)} \right] \\ &= \frac{1}{n} \sum_{i=1}^n \mu \\ &= \mu \\ \end{align*}\] \[\begin{align*} \mathbb{V}[M] &= \mathbb{V} \left[ \frac{1}{n} \sum_{i=1}^n X^{(i)} \right] \\ &= \frac{1}{n^2} \sum_{i=1}^n \mathbb{V} \left[ X^{(i)} \right] \\ &= \frac{1}{n^2} \sum_{i=1}^n \sigma^2 \\ &= \frac{\sigma^2}{n} \\ \end{align*}\]

We may not know what \(\mathcal{M}\) or \(\mathcal{X}\) are, but we do know that the mean and variance of \(\mathcal{M}\) and \(\mathcal{X}\) are related. Importantly, the variance of \(\mathcal{M}\) is related to the number of samples, \(n\), but its mean is not.

We can easily see these facts through the notation we use. Separating what \(m\) is from what \(M\) is, conceptually, and why that matters in statistics, is one of the most important things that people fail to grasp, in my teaching experience.

Why do we use 1/(n-1) for the sample variance

I’ll finish with one more example that trips up a lot of people when they start learning statistics. When we talk about the variance (or standard deviation), why is there an \(\frac{1}{n - 1}\) instead of \(\frac{1}{n}\) for the “sample” compared to the “population”? I propose that we can clear up the confusion that leads to this questions by using clear notation. The clear notation will emphasize what we’re trying to calculate and what objects we’re trying to relate to each other better than using terms like “population” or “sample”.

Consider the situation from the last example. We have \(n\) independent random variables \(X^{(1)}, ..., X^{(n)} \sim \mathcal{X}\). Let’s define \(\mu = \mathbb{E} \left[ X^{(i)} \right]\) and \(\sigma^2 = \mathbb{V} \left[ X^{(i)} \right]\) as some finite but unknown values.

How can we estimate what \(\sigma^2\) is from our random variables? And how can we ensure our estimates are accurate?

Let’s pretend we’re working with some other finite random variable, \(Z\), that could be one of \(n\) values. Let’s denote these \(n\) possible values for \(Z\) as \(z^{(1)}, ..., z^{(n)}\). Then we know that the mean and variance of the random variable \(Z\) would be

\[\begin{align*} \mathbb{E} \left[Z \right] &= \frac{1}{n}\sum_{i = 1}^n z^{(i)} \\ \mathbb{V} \left[Z \right] &= \mathbb{E} \left[ (z^{(i)} - \mathbb{E}[Z])^2 \right] \\ &= \frac{1}{n}\sum_{i = 1}^n (z^{(i)} - \mathbb{E}[Z])^2 \\ &= \frac{1}{n}\sum_{i = 1}^n \left( z^{(i)} - \frac{1}{n}\sum_{j = 1}^n z^{(j)} \right)^2 \\ \end{align*}\]

We still don’t know how to estimate \(\sigma^2\), but we can start with the above to hazard a guess. Let’s construct a random variable out of the random variables we’re given, \(\{ X^{(i)} \}\). Because it’s made of random variables, this new value is itself a random variable (Rule 1), so we’ll give it a name with a capital letter, say \(R\). We can’t use \(\sigma^2\) or \(\mu\) in defining \(R\) since we don’t actually know what they are.

\[\begin{align*} R &= \frac{1}{n}\sum_{i = 1}^n \left( X^{(i)} - \frac{1}{n}\sum_{j = 1}^n X^{(j)} \right)^2 \\ &= \frac{1}{n}\sum_{i = 1}^n \left( X^{(i)} - M \right)^2 \\ \end{align*}\]

where \(M\) is defined as in the previous section.

Is this a good way to estimate \(\sigma^2\)? We’re not sure right now, but because \(R\) is a random variable, \(R\) has some distribution \(\mathcal{R}\), which should have some mean and variance related to the mean and variance of \(\mathcal{X}\) (namely \(\mu\) and \(\sigma^2\)). Let’s try to calculate \(\mathbb{E}[R]\).

\[\begin{align*} \mathbb{E}[R] &= \mathbb{E} \left[ \frac{1}{n}\sum_{i = 1}^n \left( X^{(i)} - M \right)^2 \right] \\ &= \frac{1}{n}\sum_{i = 1}^n \mathbb{E} \left[ \left( X^{(i)} - M \right)^2 \right] \\ &= \frac{1}{n}\sum_{i = 1}^n \mathbb{E} \left[ \left( \left\{ X^{(i)} - \mu \right\} - \left\{ M - \mu \right\} \right)^2 \right] \\ &= \frac{1}{n}\sum_{i = 1}^n \mathbb{E} \left[ \left( X^{(i)} - \mu \right)^2 - 2(X^{(i)} - \mu)(M - \mu) + \left( M - \mu \right)^2 \right] \\ &= \frac{1}{n}\sum_{i = 1}^n \left[ \mathbb{E} \left[ \left( X^{(i)} - \mu \right)^2 \right] - 2\mathbb{E} \left[ (X^{(i)} - \mu)(M - \mu) \right] + \mathbb{E} \left[ \left( M - \mu \right)^2 \right] \right] \\ &= \frac{1}{n}\sum_{i = 1}^n \left[ \mathbb{V} \left[ X^{(i)} \right] - 2 \mathbb{Cov}[X^{(i)}, M] + \mathbb{V} \left[ M \right] \right] \\ &= \frac{1}{n}\sum_{i = 1}^n \left[ \sigma^2 - 2 \frac{\sigma^2}{n} + \frac{\sigma^2}{n} \right] \\ &= \left( 1 - \frac{1}{n} \right) \sigma^2 \\ \end{align*}\]

As you can see, the expected value of our guess \(R\) isn’t exactly \(\sigma^2\), but it’s pretty close2. The mean of \(\mathcal{R}\) is a little bit smaller than the variance of \(\mathcal{X}\) (which is what we really want to know). If we fiddle with \(R\) by multiplying by \(\frac{n}{n-1}\), we’ll get an expected value of \(\sigma^2\). So let’s do that.

Let \(S\) be a random variable defined by

\[\begin{align*} S &= \frac{n}{n - 1} R \\ &= \frac{1}{n - 1}\sum_{i = 1}^n \left( X^{(i)} - \frac{1}{n}\sum_{j = 1}^n X^{(j)} \right)^2 \\ &= \frac{1}{n - 1}\sum_{i = 1}^n \left( X^{(i)} - M \right)^2 \\ \end{align*}\]

Then \(\mathbb{E}[S] = \sigma^2 = \mathbb{V}[X^{(i)}]\). This is what everyone comes to see as the “sample variance”, which is different from the “population variance”. But at no point did I need to introduce “sample” or “population” as special terms. All I needed was the notation of random variables and distributions. Because of the rules around picking notation, it was immediately clear that constructing something out of random variables was itself a random variable, and that it will have certain properties.

The underlying distributions \(\mathcal{X}\) and \(\mathcal{S}\) are related to each other, and since we want to know information about \(\mathcal{X}\) but can’t observe it directly, we can construct some new random variable and still learn information about \(\mathcal{X}\). It’s also clear where this \(n-1\) in the denominator comes from, and why it’s not \(n\) like when we work with a population of discrete variables.

Conclusions

Restating one problem that you don’t know how to solve as another problem that you do is the most important way of solving problems. When you use a good notation, it let’s you play around with the notation directly in a way that you do understand. Playing with notation, then, has a direct meaning with the abstract mathematical idea you’re working with. That’s why good notation is so important.

With good notation, and clear rules for it, I’ve tried to show that certain concepts in stats that are tricky for many, are easier than often taught. This doesn’t make stats simple, but does make it easier.

I use the rules listed above when solving stats problems as I find that it significantly increases my comfort level and ability to see when I’ve made a mistake. Hopefully this notation may be useful for others trying to learn some stats.

Footnotes

  1. “Notation as a Tool of Thought” is a long and interesting read about precisely this topic. It covers the important characteristics of a good notation and goes into examples in algebra, graph theory, and its application in computer programming. It’s worth a read, if you’re interested in this topic. 

  2. We used \(\mu\) above to help simplify the equation, but we don’t actually make use of its explicit value anywhere. This is great, because we still don’t know what \(\mu\) is, even though we know it exists.