[Cryptography] paragraph with expected frequencies

Thu Dec 21 09:07:41 EST 2017

On 12/20/2017 02:02 AM, Robin Wood wrote:
> I'm working on a bit of crypto with my young daughter and we are about to
> look at frequency analysis. Are there any short UK English paragraphs where
> the frequency of letters is about what you would expect based on frequency
> charts? i.e. E then T, A and O.
> 
> Bonus if the digraphs are also roughly in order.
> 
> I want to count the letters by hand so don't want anything too long and it
> has to be PG content.

The question is both trivial to answer, and impossible.

It is trivial for linguistic and cryptological reasons:
Almost any reasonably large sample of English will
display characteristic English letter-frequencies.

This is not mathematically guaranteed;  it is just a
known property of natural language.

It is an important property.  Frequency analysis is
not a known-text or chosen-text attack, where you
know a_priori that the text has the exact "expected
frequencies".  It works for any halfway-reasonable
text.  This is the fatal weakness of any monoalphabetic
substitution cipher.

==========

In contrast, there are good mathematical reasons why
no finite sample will display the "expected frequencies"
exactly.

Frequency is a type of probability.  There are lots of
probabilities in this world, and lots of frequencies.
In this case we are particularly interested in the
/population/ i.e. all possible texts, which is an
effectively infinite set, and various finite /samples/
that might be drawn from the population.  Statisticians
give these terms technical meanings which unfortunately
diverge from the meanings in any other context, but
let's stick with the statistical definitions here.

The frequencies observed on any sample will converge
to the frequencies on the population in the limit
of large sample-sizes ... but we are talking about
convergence in the limit, not equality for any finite
sample.

For any finite sample, /statistical fluctuations/
guarantee that the sample frequencies are expected
to differ from the population frequencies.  You can
use properties of the population to predict the
distribution of fluctuations (as a function of
sample size) if you want.

The larger the number of observables (e.g. the 26
different letter frequencies) the smaller your
chance of seeing the "expected frequencies" exactly.

On the other hand, the point of the exercise is
statistical /inference/.  Frequency analysis allows
you to infer that the text is English, as opposed
to gibberish.  With a reasonable-sized sample, you
can infer this with high confidence _despite_ the
fluctuations.  The confidence will never be exactly
100%, because the tail of the English distribution
will overlap the tail of the gibberish distribution
"somewhat", but this is not a problem in practice.

Even if you could hunt up a sample that did have
the exact "expected frequencies", it would be very
unwise to use it as the basis of a lesson, because
it would teach a wrong lesson about statistical
fluctuations and statistical inference.

==> A much better lesson would be to repeat the
experiment with a few different sample-sizes from
the same source, to demonstrate the mathematical
point about fluctuations and convergence ... and
then compare a few disparate sources (e.g. Dickens
versus Rowling), to demonstrate the linguistic
point about near-invariance of the frequencies.
Thirdly, histogram a random process (diceware)
as a control.

Counting using tally-marks (a) is easier and (b)
constructs a histogram on the fly.  Plot a large
sample with N subsamples, using N colors of ink,
all on the same cumulative histogram, so you can
see the fluctuations and the convergence at a glance.

Digraphs converge 26 times more slowly, for obvious
reasons, and so require much larger samples.  This
should come several turns later on the pedagogical
spiral.