In all the talk of super computers there is not...

Fri Sep 7 14:15:29 EDT 2007

This is an interesting analysis, and the right way to proceed, I think,
when dealing with passphrases that contain sequences of natural language
words. I I think the 2.5 bits per word estimate, however, is a massive
leap.

The problem is that, when it comes to word sequences, the perplexity of
each contiguous word is much higher than for each contiguous letter in the
same text.  That is, there are many more possible symbols that can follow
the current one.  You identified several alternatives that have the "had
a" prefix, and they are are fairly likely.  Given the "had a" bigram, it
might be the case the the conditional entropy of the following token is
fairly small, compared to the general entropy of English unigrams.  If
"had a little" isn't right, though, the number of possible words that
might come next is massive.

I think the proper way to continue with this analysis would be to look
into  the research that has been done on n-gram language models.  I think
you'll find that even the best models will never get the conditional
entropy of an arbitrary word down to 2.5 bits.  That would mean that
basically had the next word narrowed down to less than 8 choices!  This
may occur in some very  common combinations, but in general the
conditional entropy will be much higher.  In addition, the phrase-initial
word will always have a fairly high perplexity, because the only thing to
condition the distribution over possible words for this case on is the
fact that it is phrase-initial.

That being said, it seems like the idea that the passprases are much less
secure than traditional character-lever analysis would suggest is spot on.
 Google recently published DVDs with their counts of the frequencies of
all n-grams up to 5-grams in their web corpus
(http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html)
Armed with that data and enough resources, one could build a language
model that would make passphrase guessing much more principled and could
reduce the amount of conditional entropy in a passphrase significantly. 
In fact, for passphrases up to 5 words in length, the entire phrase is
probably already in the Google data, it's just a matter of having the
resources to be able to get through them all.

--dan

> Leichter, Jerry wrote:
>> > | A couple of questions. How did you come up with the ~2.5 bits per
>> > | word? Would a longer word have more bits?
>> > He misapplied an incorrect estimate!   :-)  The usual estimate - going
>> > back to Shannon's original papers on information theory, actually - is
>> > that natural English text has about 2.5 (I think it's usually given as
>> > 2.4) bits of entropy per *character*.  There are several problems
>> here:
>> >
>> > 	- The major one is that the estimate should be for *characters*,
>> > 		not *words*.  So the number of bits of entropy in
>> > 		a 55-character phrase is about 137 (132, if you use
>> > 		2.4 bits/character), not 30.
>
>
> I think in weird ways.  :-)  The rationale behind it follows:
>
> I assume that the passphrase is in syntactically correct English. So,
> number of possible combinations is reduced by the great amount. Also, I
>  want to reduce the number of combinations, so I focus on the most
> probable sentences.
>
> It seems ideal to use some stochastic grammar to describe this problem.
> This type of grammar can be used to produce:
>
> 1) probabilities for the sentences so:
> 	a) total count (state space) can be reduced by threshold
> 	b) sentences could be sorted by probability
> 2) estimate of shannon entropy (which allows me to estimate bits per
> sentence or per word and possibly to craft more effective algorithm to
> walk through the space)
>
> At this point I did a little test for one phrase and played with it a
> little. I wanted to know, how likely is that using stochastic grammar
> description, someone can infer that passphrase. I asked google for
> aproximate count on phrases (results sorted by count):
>
> "had a look" 2100000
> "had a car"   591000
> "had a little lamb" 590000
> "had a drink" 562000
> "mary had a little lamb" 522000
> "had a fight" 466000
> mary had a little fleece white snow 322000 //not a phrase
> "had a president" 80200
> "had a snow"   42400
> "had a lamb"   27300
>
> and also:
>
> "I have been there"  947000
> "to rescue"         2190000
>
> "had" 1.2E9
> "is"  3.68E9
> "the" 5E9
> "a"  7.2E9
>
>>From this I assumed that google indexes about 5*109 pages. It can bee
> seen clearly, that "had a little lamb" is common phrase (relatively,
> between similar phrases). It can be also seen, that the whole rhyme has
> about an half count of phrase "had a little lamb".
>
> At this point I decided not to continue further and assumed, that this
> passphrase has very low information content, so I used value of about
> 2.5bits/word (which don't seem unreasonable when looking at the numbers
> above). I didn't calculated the actual value, it can be higher or lower.
> If the passphrase is "The car looked at me with a telescope.", I would
> estimate it higher (unusual combination of words).
>
> Thinking about that original passphrase at character level shannon way
> is incorrect. It overestimates information in that sentence. Word level
> is better, but still not good enough. Information content is
> overestimated by many, especially for political speeches.  ;-)
>
> -- Martin Tomasek
>
> ---------------------------------------------------------------------
> The Cryptography Mailing List
> Unsubscribe by sending "unsubscribe cryptography" to
> majordomo at metzdowd.com
>

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com