Entropy Definition (was Re: passphrases with more than 160 bits of entropy)
John Denker
jsd at av8n.com
Sat Mar 25 19:26:51 EST 2006
In the context of
>> 0 occurs with probability 1/2
>> each other number from 1 to 2^{160}+1 happens with
>> probability 2^{-161}.
I wrote:
> This ... serves to illustrate, in an exaggerated way, the necessity
> of not assuming that the raw data words are IID (independent and identically
> distributed).
Correction: IID isn't the issue here. The raw data words described
above are IID. That's not the problem. The problem is that the
distribution is highly skewed.
Executive summary: Small samples do not always exhibit "average" behavior.
Let me explain. There is a very simple way to look at this.
Consider the expression for the entropy of the source,
S := Sum_i P_i log(1/P_i) [1]
where i runs over all symbols in the alphabet.
One may, with a little care, attribute log(1/P_i) bits of unpredictability
to the (i)th symbol in the alphabet. Then S can be interpreted as the
appropriately weighted _average_ unpredictability per symbol.
In the example quoted above, the minimum log(1/P_i) is vastly smaller than
the average log(1/P_i) -- namely 1 versus 161. So focussing on the average
is unlikely to solve all the world's problems.
In crypto applications (including RNG construction), a crude yet provably
correct approach is to rely on the minimum (not the average) per-symbol
unpredictability. Using this approach, it would require 80 samples of the
given distribution to produce an output with 80 bits of entropy.
Things can only get better from here:
-- With full knowledge of the source statistics, one can distinguish the large
log(1/Pi) words from the small log(1/Pi) words. This would allow the number
of required samples to be closer to the typical value (2) than to the worst-case
value (80). An example of this is the Huffman coding I discussed in my previous
note.
-- With even modest knowledge of the given source, one can (by histogramming
if nothing else) estimate the probabilities well enough to yield worthwhile
improvements in efficiency.
-- If you want to produce a long sequence of output words, as you often do, a
reasonable-sized buffer is all you really need to come fairly close to optimal
efficiency, namely only about 1 sample of the distribution, on average, per
80-bit output word. The chance of getting fooled can be made verrry small,
at a modest cost in terms of buffer-size and extra samples of the distribution.
That is, the law of large numbers comes to your rescue sooner or later, usually
rather soon.
-- It may be possible to engineer a different source with larger minimum
log(1/Pi).
Bottom line: Setting up a highly-skewed distribution and then drawing from
it only a single sample is not guaranteed to produce "average" behavior. Duh,
no surprise there!
The source entropy S in equation [1] is a single number that characterizes the
"average" behavior. For a small sample from a highly-skewed distribution, S
doesn't tell you everything you need to know. This has no bearing on the
definition of entropy; entropy is defined by equation [1]. It just means that
equation [1] by itself doesn't solve all the world's problems.
---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com
More information about the cryptography
mailing list