[Cryptography] A question on Shannon's entropy
David Johnston
dj at deadhat.com
Thu Mar 3 10:56:04 EST 2016
On 3/3/16 12:02 AM, mok-kong shen wrote:
> Am 03.03.2016 um 04:37 schrieb Thierry Moreau:
>> On 02/03/16 11:31 PM, mok-kong shen wrote:
>>>
>>> Given 2 alphabetical sequences A and B of the same length. If A
>>> has higher entropy than B, then (if I don't err) this is commonly
>>> interpreted in crypto to imply that A is more random than B. Now
>>> suppose A has entropy u. Sort A to result in C such that all a's
>>> are in front of all b's etc. Since the frequencies of the characters
>>> are not affected by the sorting, C has also entropy u according to
>>> Shannon's formula. However C is evidently less random than A. How
>>> could one resolve this prolem?
>>
>> Wrong application of Shannon lessons.
>>
>> An entropy measure is tied to a data source that can emit codes without
>> a size limit for the output.
>>
>> In your question A and B are not data sources. If they were to be
>> equated to a data source, the respective codes would have to be defined,
>> e.g. as "alphabetic sequences of length-of-A" for A, and then sorting a
>> sample from A creates a very different data source.
>>
>> Said otherwise, the Shanon entropy of a sequence makes no sense. You
>> must make assumptions of the data source behind the sequence and analyze
>> the data source itself with the Shanon lessons.
>
> But in order to measure the randomness in a source, one has to
> take samples. These samples are necessarily strings of finite
> sizes in practice. Now suppose samples from one source looks quite
> random and samples from the other source happen to be (approximately)
> the sorted sequences of the first. In that case I surmise that there
> is yet some problem.
You can't measure randomness by looking at the data from a source. You
can infer bounds for the randomness (using loose terms here) from
analyzing the source generation process, but all you can get from
looking at the data are statistics. Entropy estimation algorithms are
just that - an estimation.
If you have a theoretical model of your source giving an entropy
prediction, a simulation of your circuit giving an entropy prediction
and a lower bound estimate for entropy from data from the physical
source then you want them all to agree, so you can be confident that the
model is right.
Sorting is a deterministic process that is applied to the data. A bad
one in the sense that it is shrinking the space of values and reducing
the entropy rate at the same time. This points to the problem of
estimating entropy based on symbol frequency. It is not sensitive to
correlation, non stationarity or algorithmic associations between
symbols. Sorting introduces both algorithmic invariants X[i] >= X[i+1]
and serial correlation (the next value is likely to be close to the
last). Never use the entropy estimate number from 'ent'. It would not
detect the difference.
More information about the cryptography
mailing list