In all the talk of super computers there is not...

mtd mtd at centrum.cz
Fri Sep 7 08:41:44 EDT 2007


Leichter, Jerry wrote:
> > | A couple of questions. How did you come up with the ~2.5 bits per
> > | word? Would a longer word have more bits?
> > He misapplied an incorrect estimate!   :-)  The usual estimate - going
> > back to Shannon's original papers on information theory, actually - is
> > that natural English text has about 2.5 (I think it's usually given as
> > 2.4) bits of entropy per *character*.  There are several problems here:
> >
> > 	- The major one is that the estimate should be for *characters*,
> > 		not *words*.  So the number of bits of entropy in
> > 		a 55-character phrase is about 137 (132, if you use
> > 		2.4 bits/character), not 30.


I think in weird ways.  :-)  The rationale behind it follows:

I assume that the passphrase is in syntactically correct English. So,
number of possible combinations is reduced by the great amount. Also, I
 want to reduce the number of combinations, so I focus on the most
probable sentences.

It seems ideal to use some stochastic grammar to describe this problem.
This type of grammar can be used to produce:

1) probabilities for the sentences so:
	a) total count (state space) can be reduced by threshold
	b) sentences could be sorted by probability
2) estimate of shannon entropy (which allows me to estimate bits per
sentence or per word and possibly to craft more effective algorithm to
walk through the space)

At this point I did a little test for one phrase and played with it a
little. I wanted to know, how likely is that using stochastic grammar
description, someone can infer that passphrase. I asked google for
aproximate count on phrases (results sorted by count):

"had a look" 2100000
"had a car"   591000
"had a little lamb" 590000
"had a drink" 562000
"mary had a little lamb" 522000
"had a fight" 466000
mary had a little fleece white snow 322000 //not a phrase
"had a president" 80200
"had a snow"   42400
"had a lamb"   27300

and also:

"I have been there"  947000
"to rescue"         2190000

"had" 1.2E9
"is"  3.68E9
"the" 5E9
"a"  7.2E9

>From this I assumed that google indexes about 5*109 pages. It can bee
seen clearly, that "had a little lamb" is common phrase (relatively,
between similar phrases). It can be also seen, that the whole rhyme has
about an half count of phrase "had a little lamb".

At this point I decided not to continue further and assumed, that this
passphrase has very low information content, so I used value of about
2.5bits/word (which don't seem unreasonable when looking at the numbers
above). I didn't calculated the actual value, it can be higher or lower.
If the passphrase is "The car looked at me with a telescope.", I would
estimate it higher (unusual combination of words).

Thinking about that original passphrase at character level shannon way
is incorrect. It overestimates information in that sentence. Word level
is better, but still not good enough. Information content is
overestimated by many, especially for political speeches.  ;-)

-- Martin Tomasek

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com



More information about the cryptography mailing list