# Entropy Definition (was Re: passphrases with more than 160 bits of entropy)

John Denker jsd at av8n.com
Sat Mar 25 19:26:51 EST 2006

```In the context of

>> 0 occurs with probability 1/2
>> each other number from 1 to 2^{160}+1 happens with
>> probability 2^{-161}.

I wrote:

> This ... serves to illustrate, in an exaggerated way, the necessity
> of not assuming that the raw data words are IID (independent and identically
> distributed).

Correction:  IID isn't the issue here.  The raw data words described
above are IID.  That's not the problem.  The problem is that the
distribution is highly skewed.

Executive summary:  Small samples do not always exhibit "average" behavior.

Let me explain.  There is a very simple way to look at this.

Consider the expression for the entropy of the source,
S := Sum_i P_i log(1/P_i)                      

where i runs over all symbols in the alphabet.

One may, with a little care, attribute log(1/P_i) bits of unpredictability
to the (i)th symbol in the alphabet.  Then S can be interpreted as the
appropriately weighted _average_ unpredictability per symbol.

In the example quoted above, the minimum log(1/P_i) is vastly smaller than
the average log(1/P_i) -- namely 1 versus 161.  So focussing on the average
is unlikely to solve all the world's problems.

In crypto applications (including RNG construction), a crude yet provably
correct approach is to rely on the minimum (not the average) per-symbol
unpredictability.  Using this approach, it would require 80 samples of the
given distribution to produce an output with 80 bits of entropy.

Things can only get better from here:
-- With full knowledge of the source statistics, one can distinguish the large
log(1/Pi) words from the small log(1/Pi) words.  This would allow the number
of required samples to be closer to the typical value (2) than to the worst-case
value (80).  An example of this is the Huffman coding I discussed in my previous
note.
-- With even modest knowledge of the given source, one can (by histogramming
if nothing else) estimate the probabilities well enough to yield worthwhile
improvements in efficiency.
-- If you want to produce a long sequence of output words, as you often do, a
reasonable-sized buffer is all you really need to come fairly close to optimal
efficiency, namely only about 1 sample of the distribution, on average, per
80-bit output word.  The chance of getting fooled can be made verrry small,
at a modest cost in terms of buffer-size and extra samples of the distribution.
That is, the law of large numbers comes to your rescue sooner or later, usually
rather soon.
-- It may be possible to engineer a different source with larger minimum
log(1/Pi).

Bottom line:  Setting up a highly-skewed distribution and then drawing from
it only a single sample is not guaranteed to produce "average" behavior.  Duh,
no surprise there!

The source entropy S in equation  is a single number that characterizes the
"average" behavior.  For a small sample from a highly-skewed distribution, S
doesn't tell you everything you need to know.  This has no bearing on the
definition of entropy;  entropy is defined by equation .  It just means that
equation  by itself doesn't solve all the world's problems.

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com

```