<br><br><div class="gmail_quote"><div dir="ltr">On Thu, 21 Dec 2017, 19:31 John Denker via cryptography, <<a href="mailto:cryptography@metzdowd.com">cryptography@metzdowd.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 12/20/2017 02:02 AM, Robin Wood wrote:<br>

> I'm working on a bit of crypto with my young daughter and we are about to<br>

> look at frequency analysis. Are there any short UK English paragraphs where<br>

> the frequency of letters is about what you would expect based on frequency<br>

> charts? i.e. E then T, A and O.<br>

><br>

> Bonus if the digraphs are also roughly in order.<br>

><br>

> I want to count the letters by hand so don't want anything too long and it<br>

> has to be PG content.<br>

<br>

<br>

The question is both trivial to answer, and impossible.<br>

<br>

It is trivial for linguistic and cryptological reasons:<br>

Almost any reasonably large sample of English will<br>

display characteristic English letter-frequencies.<br>

<br>

This is not mathematically guaranteed;  it is just a<br>

known property of natural language.<br>

<br>

It is an important property.  Frequency analysis is<br>

not a known-text or chosen-text attack, where you<br>

know a_priori that the text has the exact "expected<br>

frequencies".  It works for any halfway-reasonable<br>

text.  This is the fatal weakness of any monoalphabetic<br>

substitution cipher.<br>

<br>

==========<br>

<br>

In contrast, there are good mathematical reasons why<br>

no finite sample will display the "expected frequencies"<br>

exactly.<br>

<br>

Frequency is a type of probability.  There are lots of<br>

probabilities in this world, and lots of frequencies.<br>

In this case we are particularly interested in the<br>

/population/ i.e. all possible texts, which is an<br>

effectively infinite set, and various finite /samples/<br>

that might be drawn from the population.  Statisticians<br>

give these terms technical meanings which unfortunately<br>

diverge from the meanings in any other context, but<br>

let's stick with the statistical definitions here.<br>

<br>

The frequencies observed on any sample will converge<br>

to the frequencies on the population in the limit<br>

of large sample-sizes ... but we are talking about<br>

convergence in the limit, not equality for any finite<br>

sample.<br>

<br>

For any finite sample, /statistical fluctuations/<br>

guarantee that the sample frequencies are expected<br>

to differ from the population frequencies.  You can<br>

use properties of the population to predict the<br>

distribution of fluctuations (as a function of<br>

sample size) if you want.<br>

<br>

The larger the number of observables (e.g. the 26<br>

different letter frequencies) the smaller your<br>

chance of seeing the "expected frequencies" exactly.<br>

<br>

On the other hand, the point of the exercise is<br>

statistical /inference/.  Frequency analysis allows<br>

you to infer that the text is English, as opposed<br>

to gibberish.  With a reasonable-sized sample, you<br>

can infer this with high confidence _despite_ the<br>

fluctuations.  The confidence will never be exactly<br>

100%, because the tail of the English distribution<br>

will overlap the tail of the gibberish distribution<br>

"somewhat", but this is not a problem in practice.<br>

<br>

Even if you could hunt up a sample that did have<br>

the exact "expected frequencies", it would be very<br>

unwise to use it as the basis of a lesson, because<br>

it would teach a wrong lesson about statistical<br>

fluctuations and statistical inference.<br>

<br>

==> A much better lesson would be to repeat the<br>

experiment with a few different sample-sizes from<br>

the same source, to demonstrate the mathematical<br>

point about fluctuations and convergence ... and<br>

then compare a few disparate sources (e.g. Dickens<br>

versus Rowling), to demonstrate the linguistic<br>

point about near-invariance of the frequencies.<br>

Thirdly, histogram a random process (diceware)<br>

as a control.<br>

<br>

Counting using tally-marks (a) is easier and (b)<br>

constructs a histogram on the fly.  Plot a large<br>

sample with N subsamples, using N colors of ink,<br>

all on the same cumulative histogram, so you can<br>

see the fluctuations and the convergence at a glance.<br>

<br>

Digraphs converge 26 times more slowly, for obvious<br>

reasons, and so require much larger samples.  This<br>

should come several turns later on the pedagogical<br>

spiral.<br></blockquote></div><div><br></div><div><br></div><div>Oh well, was worth a try. I'll grab some cuttings from different newspapers and start with those and see how we go.</div><div><br></div><div>Start with some counting by hand then write some code to do bigger texts and create graphs.</div><div><br></div><div>Might be interesting to try to find texts that do fit the expected frequencies, maybe a discussion of the Queen jumping in a Zumba class :)</div><div><br></div><div>Robin</div><div><br></div><div>Robin</div><div><br></div><div><br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

_______________________________________________<br>

The cryptography mailing list<br>

<a href="mailto:cryptography@metzdowd.com" target="_blank">cryptography@metzdowd.com</a><br>

<a href="http://www.metzdowd.com/mailman/listinfo/cryptography" rel="noreferrer" target="_blank">http://www.metzdowd.com/mailman/listinfo/cryptography</a></blockquote></div>