Demystifying Chao1 with Good-Turing
To estimate the “biodiversity” in a particular area – or, in other words, the number of unique species living in a given environment – ecologists usually have no other option than to rely on incomplete samples. For all sorts of practical reasons, it is virtually impossible to spot all species that are actually present in an area, and hence certain species will be missing in the counts. Consequently, an important research question in ecology is how to reliably estimate the resulting “bias” between the number of species that was observerd and the true number of unique species in an area.
Of course, scientists are not only confronted with the possibility of ‘unseen entities’ when counting animal species. In astronomy, too, for example, researchers are interested in counting the number of unseen stars. In linguistics, researchers have also attempted to estimate the total size of a language’s vocabulary using only samples of that language. And in computer science, researchers have tried to estimate the number of undetected “bugs” in software code. In all these applications, the same problem arises: there are more unique stars, words, or bugs than we observe in our samples, and therefore we cannot base our judgement about the diversity of the phenomenon we are investigating directly on our observed counts. That would give a biased – or worse, an i≈mpoverished – picture of the true state of affairs. Hence, the statistical methods that have been developed in ecology to combat such biases are very useful in a wide variety of domains.
In the cultural domain, too, there have been attempts to correct biases in our observations with statistical methods. A good example is our study into the loss of European medieval literature Kestemont et al. (2022), which was recently published in the journal Science. In that study, we apply a specific “Unseen Species” model to medieval stories. Medieval stories (such as the legends about “King Arthur” or the stories surrounding the emperor Charlemagne) were often recorded several times in manuscripts. But not all of these copies or manuscripts have been preserved – and, therefore it is possible that the stories they contained have not been preserved either. Our aim, then, was to use an “Unseen Species” model to estimate how many “documents” (the manuscripts) and how many “works” (the stories) and been lost. The model we used is called the “Chao1 estimator”, named after its discoverer, Anne Chao, co-author of our study.
We are often asked how Chao1 can make these estimates of unseen things. But that question is not so easy to answer, because the exact motivation and derivation of Chao1 is not particularly intuitive (neither for ecologists, nor for cultural scholars). But fortunately, as Chao et al. (2017) have shown, Chao1 can also be derived in a slightly different and much more intuitive way. This derivation is based on the groundbreaking work of the famous computer scientist Alan Turing, which he did at Bletchley Park during World War II. Below, guided by Chao et al. (2017), I will try to demystify Chao1 in terms of Turing’s work. I hope that this will help people unfamiliar with the method to understand exactly how it works, and, perhaps more importantly, how (not) to interpret the calculated values.
Imagine that we discovered a snippet from an unknown book. The snippet has a total of 18 words on it, consisting of 6 unique words with the following frequencies:
word | frequency |
---|---|
the | 10 |
for | 3 |
woman | 2 |
park | 1 |
snake | 1 |
book | 1 |
The total number of words as well as the number of unique words in this scenario are of importance. Based on this snippet, where the total frequency \(n\) is 18, we would like to estimate how many unique words there were in the original, unaffected book. The snippet thus provides us with a “sample” or a subset of the vocabulary in the original work. The complete vocabulary can be indicated by the symbol \(V\), and the observed vocabulary in the snippet is then called \(V_{\textrm{obs}}\). The aim of unseen species models, such as Chao1, is then to estimate the true number of unique words \(V\) on the basis of the observed unique vocabulary items (\(V_{\textrm{obs}}\)) and their frequencies. To this end, we need to know about how many words have a frequency of 0 in our snippet – or, in other words, how many unique words are missing from our sample. (Note that we assume that the snippet was generated under a sampling process with replacement. This is an assumption of Chao1, which not always holds for cultural data, and is completely unrealistic when working with snippets of text. See Estimating Richness under Sampling without Replacement for a more elaborate discussion of sampling strategies in relation to Chao1.) If we refer to that number as \(f_0\), we can calculate the total number of words \(V\) by adding \(f_0\) to the number of unique observed words \(V_{\textrm{obs}}\):
\begin{equation} V = f_0 + V_{\textrm{obs}} = f_0 + 6 \end{equation}
The key question is, of course, how to calculate \(f_0\). If we assume that our snippet represents the complete work, it is easy to calculate the probability of a word such as book. Let’s illustrate that with the word book: if the snippet represents the whole work, we can simply divide the frequecy of book by the total number of words: \(\frac{1}{18} = 0.056\). Similarly, we can calculate the probability that a word has a frequency \(r\), for example \(r = 1\) (words that occur only once, i.e. singletons). There are 3 words in total with a frequency of \(r = 1\), which we write down as \(f_1 = 3\). The probability that a word has a frequency of 1 is then equal to \(f_1\) divided by the total number of words in a book: \(\frac{f_1}{n} = \frac{3}{18} = 0.167\). And the average relative frequency of words with \(r = 1\) is then equal to: \(\alpha_1 = \frac{1}{18} f_1 / f_1 = \frac{1}{18}\). But of course the 18-word snippet is just a snippet, and not a whole work. Because the snippet is not complete, there is a considerable chance that we have not yet seen certain words from the original, complete book. Therefore, in our probability calculations, we must take into account the possibility that we will come across a range of unseen words. But how?
Well, let’s suppose that we find another snippet of the same manuscript containing one new word. What is then the probability that this new word is another instance of the already observed book? Note that if this new word is indeed the word book, then book no longer occurs once but twice. In other words, book then no longer has the frequency \(r = 1\) but \(r = 1 + 1\). To account for such situations, Turing and his student Good proposed not to estimate the probability of a word occurring \(r\) times with \(\frac{r}{n}\) but with an adjusted count \(\frac{r^*}{n}\) (note the asterisk). This modified count states that the probability that we have seen a new words \(r + 1\) times is equal to the probability of the words we have already seen \(r + 1\) times. This can be formulated as: \(r^*= (r + 1) f_{r + 1} / f_r\). The average relative frequency of words that occur \(r\) times can then be computed with:
\begin{equation} \alpha_r = \frac{r^*}{n} = \frac{(r + 1)}{n} \frac{f_{r + 1}}{f_r}. \end{equation}
To illustrate: the word book occurs once in the snippet and there are two other words that occur once. This, as stated before, gives \(f_1 = 3\). There is one word with a frequency of 2, so \(f_2 = 1\). If we plug these number into the formula above, we can calculate the probability that the new word is the word book: \(\frac{(1 + 1)}{18} \frac{1}{3} = 0.037\). It is important to note that this new probability is smaller than the relative frequency of book in the snippet that we caculated earlier (\(1 / 18 = 0.056\)). This is because the Good-Turing counts act as a kind of frequency correction, ensuring that we leave some probability space for any unseen words – and hence the corrected probability is smaller.
Before we continue with showing how to use this all for computing \(f_0\), it is important to make a small modification to the original Good-Turing formula. In a follow-up study, Chiu et al. (2014) show that the Good-Turing estimate can be made even more precise with:
\begin{equation} \alpha_r = \frac{(r + 1) f_{r + 1}}{(n - r) f_r}. \end{equation}
For unseen words, we cannot calculate the average relative frequency \(\alpha_0\), because we have no knowledge of \(f_0\): \(\frac{f_1}{n f_0=?}\). What we are able to calculate, however, is the estimated proportion of the total number of words that are unseen. That proportion is equal to the product of the average relative frequency of unseen words \(\alpha_0\) and the number of unseen words \(f_0\): \(\alpha_0 f_0\). This product can be approximated by dividing the number of words that occur once by the total number of words: \(\frac{f_1}{n}\). Essentialy, we strip \(f_0\) from the formula.
We are now almost in the position to grasp how to estimate \(f_0\). As I mentioned, we can estimate the product \(\alpha_0 f_0\) with \(\frac{f_1}{n}\). A simple way to compute \(f_0\) would be to divide this product by \(\alpha_0\). But unfortunately, we do not know the average relative frequency of unseen words. We therefore have to work with what we do know, for example that the average relative frequency of unseen words is probably lower that that of words occuring once. That is to say: \(\alpha_0 \leq \alpha_1\). It then follows that \(\frac{\alpha_0 f_0}{\alpha_0}\) must be at least equal to or greater than \(\frac{\alpha_0 f_0}{\alpha_1}\):
\begin{equation} f_0 = \frac{\alpha_0 f_0}{\alpha_0} \geq \frac{\alpha_0 f_0}{\alpha_1}. \end{equation}
We can rewrite that last formula (\(\frac{\alpha_0 f_0}{\alpha_1}\)) to the calculable \(\frac{f1}{n} / \frac{2f_2}{(n - 1)f_1}\). If we apply this formula to our snippet, the estimate is that we have not yet seen at least \(\frac{3}{18} / \frac{2 * 1}{(18 - 1) * 3} \approx 4\) words yet. It is important to emphasize again that \(\frac{\alpha_0 f_0}{\alpha_0}\) is at least equal to or greater than \(\frac{\alpha_0 f_0}{\alpha_1}\), and that our estimate of \(f_0\) is thus a lower bound on the true number of unique words. This estimate, based on Good-Turing counts, gives the same lower bound for \(f_0\) that Anne Chao has proven extensively (Chao 1984), and can be calculated with her well-known unseen species model, Chao1, which can be formulated as: \(\frac{(n - 1)}{n} \frac{f_1^2}{2 f_2}\).
The Chao1 model thus provides a lower bound of the ecological, or cultural, richness in a given area. Using the Good-Turing perspective, we can now also clarify when the lower bound is in fact an accurate estimate of the true richness. The reasoning is as follows: if the average frequency of unseen words (\(\alpha_0\)) is approximately equal to that of words that occur once (\(\alpha_1\)), then there is no longer any difference between \(\frac{\alpha_0 f_0}{\alpha_0}\) and \(\frac{\alpha_0 f_0}{\alpha_1}\), and thus the method can provide an unbiased estimate of the number of unique species, or say, cultural artefacts, such as medieval stories that exist in a given area – including the ones that have not been seen.