Entropy Definition (was Re: passphrases with more than 160 bits of entropy)

Sat Mar 25 19:26:51 EST 2006

In the context of

 >> 0 occurs with probability 1/2
 >> each other number from 1 to 2^{160}+1 happens with
 >> probability 2^{-161}.

I wrote:

 > This ... serves to illustrate, in an exaggerated way, the necessity
 > of not assuming that the raw data words are IID (independent and identically
 > distributed).

Correction:  IID isn't the issue here.  The raw data words described
above are IID.  That's not the problem.  The problem is that the
distribution is highly skewed.

Executive summary:  Small samples do not always exhibit "average" behavior.

Let me explain.  There is a very simple way to look at this.

Consider the expression for the entropy of the source,
        S := Sum_i P_i log(1/P_i)                      [1]

where i runs over all symbols in the alphabet.

One may, with a little care, attribute log(1/P_i) bits of unpredictability
to the (i)th symbol in the alphabet.  Then S can be interpreted as the
appropriately weighted _average_ unpredictability per symbol.

In the example quoted above, the minimum log(1/P_i) is vastly smaller than
the average log(1/P_i) -- namely 1 versus 161.  So focussing on the average
is unlikely to solve all the world's problems.

In crypto applications (including RNG construction), a crude yet provably
correct approach is to rely on the minimum (not the average) per-symbol
unpredictability.  Using this approach, it would require 80 samples of the
given distribution to produce an output with 80 bits of entropy.

Things can only get better from here:
  -- With full knowledge of the source statistics, one can distinguish the large
   log(1/Pi) words from the small log(1/Pi) words.  This would allow the number
   of required samples to be closer to the typical value (2) than to the worst-case
   value (80).  An example of this is the Huffman coding I discussed in my previous
   note.
  -- With even modest knowledge of the given source, one can (by histogramming
   if nothing else) estimate the probabilities well enough to yield worthwhile
   improvements in efficiency.
  -- If you want to produce a long sequence of output words, as you often do, a
   reasonable-sized buffer is all you really need to come fairly close to optimal
   efficiency, namely only about 1 sample of the distribution, on average, per
   80-bit output word.  The chance of getting fooled can be made verrry small,
   at a modest cost in terms of buffer-size and extra samples of the distribution.
   That is, the law of large numbers comes to your rescue sooner or later, usually
   rather soon.
  -- It may be possible to engineer a different source with larger minimum
   log(1/Pi).

Bottom line:  Setting up a highly-skewed distribution and then drawing from
it only a single sample is not guaranteed to produce "average" behavior.  Duh,
no surprise there!

The source entropy S in equation [1] is a single number that characterizes the
"average" behavior.  For a small sample from a highly-skewed distribution, S
doesn't tell you everything you need to know.  This has no bearing on the
definition of entropy;  entropy is defined by equation [1].  It just means that
equation [1] by itself doesn't solve all the world's problems.

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com