[Cryptography] A question on Shannon's entropy

Thu Mar 3 10:56:04 EST 2016

On 3/3/16 12:02 AM, mok-kong shen wrote:
> Am 03.03.2016 um 04:37 schrieb Thierry Moreau:
>> On 02/03/16 11:31 PM, mok-kong shen wrote:
>>>
>>> Given 2 alphabetical sequences A and B of the same length. If A
>>> has higher entropy than B, then (if I don't err) this is commonly
>>> interpreted in crypto to imply that A is more random than B. Now
>>> suppose A has entropy u. Sort A to result in C such that all a's
>>> are in front of all b's etc. Since the frequencies of the characters
>>> are not affected by the sorting, C has also entropy u according to
>>> Shannon's formula. However C is evidently less random than A. How
>>> could one resolve this prolem?
>>
>> Wrong application of Shannon lessons.
>>
>> An entropy measure is tied to a data source that can emit codes without
>> a size limit for the output.
>>
>> In your question A and B are not data sources. If they were to be
>> equated to a data source, the respective codes would have to be defined,
>> e.g. as "alphabetic sequences of length-of-A" for A, and then sorting a
>> sample from A creates a very different data source.
>>
>> Said otherwise, the Shanon entropy of a sequence makes no sense. You
>> must make assumptions of the data source behind the sequence and analyze
>> the data source itself with the Shanon lessons.
>
> But in order to measure the randomness in a source, one has to
> take samples. These samples are necessarily strings of finite
> sizes in practice. Now suppose samples from one source looks quite
> random and samples from the other source happen to be (approximately)
> the sorted sequences of the first. In that case I surmise that there
> is yet some problem.
You can't measure randomness by looking at the data from a source. You 
can infer bounds for the randomness (using loose terms here) from 
analyzing the source generation process, but all you can get from 
looking at the data are statistics. Entropy estimation algorithms are 
just that - an estimation.

If you have a theoretical model of your source giving an entropy 
prediction, a simulation of your circuit giving an entropy prediction 
and a lower bound estimate for entropy from data from the physical 
source then you want them all to agree, so you can be confident that the 
model is right.

Sorting is a deterministic process that is applied to the data. A bad 
one in the sense that it is shrinking the space of values and reducing 
the entropy rate at the same time. This points to the problem of 
estimating entropy based on symbol frequency. It is not sensitive to 
correlation, non stationarity or algorithmic associations between 
symbols. Sorting introduces both algorithmic invariants X[i] >= X[i+1] 
and serial correlation (the next value is likely to be close to the 
last). Never use the entropy estimate number from 'ent'. It would not 
detect the difference.