[Cryptography] A question on Shannon's entropy

Wed Mar 2 19:26:27 EST 2016

On 03/02/2016 04:31 PM, mok-kong shen wrote:
> 
> Given 2 alphabetical sequences A and B of the same length. If A
> has higher entropy than B, then (if I don't err) this is commonly
> interpreted in crypto to imply that A is more random than B. 

Well, "cryptography" is a huge field, and "random" is not
strictly defined.  (This stands in contrast to Shannon entropy,
aka plain old entropy, which is verrry well defined.)

In cryptography, sometimes entropy is exactly what you want, and
sometimes not.  In particular, entropy measures a particular
average property of the distribution.  There are *lots* of other
properties that the distribution might have.  Depending on your
threat model, you might be much more interested in minimax 
properties as opposed to average properties.

Also it is crucial to note that strings do not have entropy,
in the same way that numbers do not have entropy.  Entropy is
a property of the /distribution/.  As some guy named John von 
Neumann famously said, there is no such thing as a random number.
 -- If it's random, it's not a number.
 -- If it's a number, it's not random.

You can have a random /distribution/ over numbers (or strings), but
then the randomness is in the distribution, not in any string
that may have been drawn from some such distribution.

> Now suppose A has entropy u. Sort A to result in C such that all a's
> are in front of all b's etc. Since the frequencies of the characters
> are not affected by the sorting, C has also entropy u according to
> Shannon's formula. However C is evidently less random than A. How
> could one resolve this prolem? One could eventually employ pairs
> of characters as unit and compute a similar measure with the
> formula. But evidently the probllem would at best be weakened but
> not basically resolved that way.

The question is based on multiple misconceptions.  Entropy is
a property of the /process/ that generates the strings.  The
machine that generates sorted strings is different from the
machine that generates unsorted string.  You can calculate the
entropy from the single-character probabilities (or, loosely 
speaking, the frequencies) /provided/ the characters are IID 
(independent and identically distributed).  The characters in 
the sorted strings are nowhere near IID, so the assertion that
the entropy is unchanged is nowhere near correct.

Here's a paper that touches on some of these issues and can
serve as a good starting point:
  C.E. Shannon
  "Prediction and Entropy of Printed English"
  http://languagelog.ldc.upenn.edu/myl/Shannon1950.pdf