[Cryptography] paragraph with expected frequencies

Tom Mitchell mitch at niftyegg.com
Fri Dec 22 20:26:28 EST 2017


On Wed, Dec 20, 2017 at 1:02 AM, Robin Wood <robin at digi.ninja> wrote:

> Hi
> Something a little less technical than a normal question...
>
> I'm working on a bit of crypto with my young daughter and we are about to
> look at frequency analysis. Are there any short UK English paragraphs where
> the frequency of letters is about what you would expect based on frequency
> charts? i.e. E then T, A and O.
>
> Bonus if the digraphs are also roughly in order.
>
> I want to count the letters by hand so don't want anything too long and it
> has to be PG content.
>

If you believe WP this is harder to do than it sounds.

I would go to Project Gutenberg and grab a pile of age appropriate books,
poems and stories.
Pull them in to a page sampler with automated counter and test.

This has promise... https://www.gutenberg.org/ebooks/20532
as does.. https://www.gutenberg.org/files/40063/40063-h/40063-h.htm

An assertion that Morris code was organized to shorten transmissions
is worthy of a test.

https://en.wikipedia.org/wiki/Letter_frequency
"Letter frequencies, like word frequencies
<https://en.wikipedia.org/wiki/Word_frequencies>, tend to vary, both by
writer and by subject. One cannot write an essay about x-rays without using
frequent Xs, and the essay will have an idiosyncratic letter frequency if
the essay is about the frequent use of x-rays to treat zebras in Qatar.
Different authors have habits which can be reflected in their use of
letters. Hemingway <https://en.wikipedia.org/wiki/Ernest_Hemingway>'s
writing style, for example, is visibly different from Faulkner
<https://en.wikipedia.org/wiki/William_Faulkner>'s. Letter, bigram
<https://en.wikipedia.org/wiki/Bigram>, trigram
<https://en.wikipedia.org/wiki/Trigram>, word frequencies, word length, and
sentence length can be calculated for specific authors, and used to prove
or disprove authorship of texts, even for authors whose styles are not so
divergent.

Accurate average letter frequencies can only be gleaned by analyzing a
large amount of representative text. With the availability of modern
computing and collections of large text corpora
<https://en.wikipedia.org/wiki/Corpus_linguistics>, such calculations are
easily made. Examples can be drawn from a variety of sources (press
reporting, religious texts, scientific texts and general fiction) and there
are differences especially for general fiction with the position of 'h' and
'i', with H becoming more common."

<http://www.metzdowd.com/mailman/listinfo/cryptography>Long poems like
Longfellow's Evangeline might be sampled to see if a five or ten line
sample from ten places in
the poem matched.






-- 
  T o m    M i t c h e l l
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.metzdowd.com/pipermail/cryptography/attachments/20171222/5ade0408/attachment.html>


More information about the cryptography mailing list