[Cryptography] quantum computers & crypto

Sun Nov 7 07:18:38 EST 2021

>> The general guidance that I read is "don't do that, because compressing
>> different clear texts results in different compressed lengths, and the
>> adversary can use the length to guess the message." But then, by definition,
>> compression reduces the entropy of the compressed plain text, which makes the
>> heuristics that you describe here harder. So, what gives?
> 
> "Don't compress before encryption" applies to specific very carefully-chosen
> examples for conference papers (CRIME, BREACH, etc).  "Compress before
> encryption" applies to real life.
There are real-world counterexamples.  A paper quite some time ago showed that encrypted, compressed speech could be read with reasonable accuracy just by looking at the lengths of blocks in streams of compressed packets.  Similarly, determining which of the top 1000, say, most popular web sites is being browsed in an encrypted session has been shown to be pretty easy - again based on the lengths of the messages being exchanged.

I've never seen anyone claiming to do this, but it should be possible to look at message exchanges on things like Signal and see if anyone is repeating message streams that appeared in the clear elsewhere.  Or one could correlate speakers who are repeating each other, even when the messages themselves can't be read.  This could be significant risk for people in certain parts of the world.

The techniques would probably work even better today because of improvements in machine learning, which is ideally suited to pick up such correlations.

These attacks work even if the underlying cipher has very strong security properties.  In fact, they even work against one-time pads.  The problem is in defining what semantic properties of the underlying plaintext are relevant.  If you look at a single message, an ideal cryptosystem makes the attackers equivocation about the plaintext equal with or without access to the ciphertext.  But in a situation where multiple messages are exchanged in a fixed sequence, you need to look not just at the individual messages, but at the entire stream.  And a system that leaks plaintext lengths fails if the distribution of plaintext lengths carries significant semantic information - as is true for speech, web sites, and perhaps encrypted message exchanges.

The fix for this is of course well known:  Padding.  If you know the entropy of the information carried by the message lengths, you could even compute the minimum amount of equivocation the padding has to introduce to deny an attacker any advantage.

I don't know of any published system that's attacked this problem, though the literature is now so large that there's probably something out there.

I'm pretty sure, however, that NSA - because of its long-standing interest in stream ciphers, which can leak bit-level lengths and thus much more information than the block-level lengths of most of today's systems; not to mention a focus on traffic analysis in general - has done significant work in this area.

                                                        -- Jerry