In all the talk of super computers there is not...

Leichter, Jerry leichter_jerrold at emc.com
Thu Sep 6 14:24:11 EDT 2007

```| > | Hi Martin,
| > |
| > | I did forget to say that it would be salted so that throws it off by
| > | 2^12
| > |
| > | A couple of questions. How did you come up with the ~2.5 bits per
| > | word? Would a longer word have more bits?
| > He misapplied an incorrect estimate!  :-) The usual estimate - going
| > back to Shannon's original papers on information theory, actually - is
| > that natural English text has about 2.5 (I think it's usually given as
| > 2.4) bits of entropy per *character*.  There are several problems
| > here:
|
| It's less than that.  See, for example, the bottom of the first page of
| http://www.cs.brown.edu/courses/cs195-5/extras/shannon-1951.pdf :
Interesting paper - I hadn't seen that one, only the earlier one that
got an estimate - cited in this one - for 2.3 (not 2.4) bits per
character for samples of length 8 (*very* roughly).

| 	From this analysis it appears that, in ordinary literary
| 	English, the long range statistical effects (up to 100 letters)
| 	reduce the entropy to something of the order of one bit per
| 	letter, with a corresponding redundancy of roughly 75%. The
| 	redundancy may be still higher when structure extending over
| 	paragraphs, chapters, etc. is included.
|
| >
| > 	- The major one is that the estimate should be for
| > *characters*, not *words*.  So the number of bits of entropy in
| > 		a 55-character phrase is about 137 (132, if you use
| > 		2.4 bits/character), not 30.
| >
| > 	- The minor one is that the English entropy estimate looks
| > just at letters and spaces, not punctuation and capitalization.
| > 		So it's probably low anyway.  However, this is a much
| > 		smaller effect.
|
| The interesting question is whether or not one can effectively
| enumerate candidate phrases for a guessing program.  For that problem,
| punctuation and capitalization are important.
Well, for *general purpose* algorithms, you can get a rough idea by
looking at how well the best compressors do.  zip deflate on a random
selection of English text I used managed to reduce the text to about 31%
of its original size.  You can't easily compare this to Shannon's 25%
estimate because zup had an easy job:  The input was 7-bit ASCII, the
top bit of every byte was always 0; and of the remaining 128 possible
bytes, at least 30 (probably more) never occur.  If we assume the
input text had only 70 possible characters in it, then there are
"really" only a little more than 6 bits of true entropy per byte
of input.  This brings the effective compression from the "smart"
parts of the algorithm down to about from 69% to 60%.

zip deflate isn't the state of the art in compression algorithms, but
nothing does all *that* much better.

I suspect the best first-try algorithm for generating attacks would be
an analogue of using a dictionary to guess passwords:  Extract phrases
of the appropriate length from the huge volume of data that is now
readily available on line.  This is likely to catch many pass phrases.

The example in the original message shows how to avoid such an attack:
Don't use "Mary had a little lamb, it's fleece was white as snow";
use a semantic equivalant "Mary had one tiny lamb, with fleece that
were white as snow".  One can probably generate many such variants
algorithmically with little trouble, though.  (What's hard is
eliminating the ones no human would likely use for deep semantic
reasons, but for an attack like this, generating extra ones only
cost you time.)

Probably out of reach today for reasonably long phrases, but I
wouldn't give it very much time.

(It would be interesting to do a detailed analysis for the often-
recommended approach of picking a phrase and using the first letters
of successive words.  Just the distribution of first letters of
words is probably biased, and what the correlation of successive
first letters looks like is anyone's guess - though given the
ready availability of data, it's trivially easy to compute.)

-- Jerry

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com

```