[Cryptography] paragraph with expected frequencies

Robin Wood robin at digi.ninja
Thu Dec 21 18:05:25 EST 2017


On Thu, 21 Dec 2017, 19:31 John Denker via cryptography, <
cryptography at metzdowd.com> wrote:

> On 12/20/2017 02:02 AM, Robin Wood wrote:
> > I'm working on a bit of crypto with my young daughter and we are about to
> > look at frequency analysis. Are there any short UK English paragraphs
> where
> > the frequency of letters is about what you would expect based on
> frequency
> > charts? i.e. E then T, A and O.
> >
> > Bonus if the digraphs are also roughly in order.
> >
> > I want to count the letters by hand so don't want anything too long and
> it
> > has to be PG content.
>
>
> The question is both trivial to answer, and impossible.
>
> It is trivial for linguistic and cryptological reasons:
> Almost any reasonably large sample of English will
> display characteristic English letter-frequencies.
>
> This is not mathematically guaranteed;  it is just a
> known property of natural language.
>
> It is an important property.  Frequency analysis is
> not a known-text or chosen-text attack, where you
> know a_priori that the text has the exact "expected
> frequencies".  It works for any halfway-reasonable
> text.  This is the fatal weakness of any monoalphabetic
> substitution cipher.
>
> ==========
>
> In contrast, there are good mathematical reasons why
> no finite sample will display the "expected frequencies"
> exactly.
>
> Frequency is a type of probability.  There are lots of
> probabilities in this world, and lots of frequencies.
> In this case we are particularly interested in the
> /population/ i.e. all possible texts, which is an
> effectively infinite set, and various finite /samples/
> that might be drawn from the population.  Statisticians
> give these terms technical meanings which unfortunately
> diverge from the meanings in any other context, but
> let's stick with the statistical definitions here.
>
> The frequencies observed on any sample will converge
> to the frequencies on the population in the limit
> of large sample-sizes ... but we are talking about
> convergence in the limit, not equality for any finite
> sample.
>
> For any finite sample, /statistical fluctuations/
> guarantee that the sample frequencies are expected
> to differ from the population frequencies.  You can
> use properties of the population to predict the
> distribution of fluctuations (as a function of
> sample size) if you want.
>
> The larger the number of observables (e.g. the 26
> different letter frequencies) the smaller your
> chance of seeing the "expected frequencies" exactly.
>
> On the other hand, the point of the exercise is
> statistical /inference/.  Frequency analysis allows
> you to infer that the text is English, as opposed
> to gibberish.  With a reasonable-sized sample, you
> can infer this with high confidence _despite_ the
> fluctuations.  The confidence will never be exactly
> 100%, because the tail of the English distribution
> will overlap the tail of the gibberish distribution
> "somewhat", but this is not a problem in practice.
>
> Even if you could hunt up a sample that did have
> the exact "expected frequencies", it would be very
> unwise to use it as the basis of a lesson, because
> it would teach a wrong lesson about statistical
> fluctuations and statistical inference.
>
> ==> A much better lesson would be to repeat the
> experiment with a few different sample-sizes from
> the same source, to demonstrate the mathematical
> point about fluctuations and convergence ... and
> then compare a few disparate sources (e.g. Dickens
> versus Rowling), to demonstrate the linguistic
> point about near-invariance of the frequencies.
> Thirdly, histogram a random process (diceware)
> as a control.
>
> Counting using tally-marks (a) is easier and (b)
> constructs a histogram on the fly.  Plot a large
> sample with N subsamples, using N colors of ink,
> all on the same cumulative histogram, so you can
> see the fluctuations and the convergence at a glance.
>
> Digraphs converge 26 times more slowly, for obvious
> reasons, and so require much larger samples.  This
> should come several turns later on the pedagogical
> spiral.
>


Oh well, was worth a try. I'll grab some cuttings from different newspapers
and start with those and see how we go.

Start with some counting by hand then write some code to do bigger texts
and create graphs.

Might be interesting to try to find texts that do fit the expected
frequencies, maybe a discussion of the Queen jumping in a Zumba class :)

Robin

Robin



> _______________________________________________
> The cryptography mailing list
> cryptography at metzdowd.com
> http://www.metzdowd.com/mailman/listinfo/cryptography
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.metzdowd.com/pipermail/cryptography/attachments/20171221/1e64feb9/attachment.html>


More information about the cryptography mailing list