[Cryptography] Compression before encryption?

Fri Jan 9 17:06:21 EST 2015

On Jan 9, 2015, at 2:22 PM, Roland C. Dowdeswell <elric at imrryr.org> wrote:
>> I have come across the recommendation to "compress before you encrypt", on the grounds that this makes plaintext recognition through frequency analysis much harder.
> You need to be careful as compression can expose certain kinds of
> "chosen plaintext" attacks.  Basically, if you can insert chosen
> plaintext early in the compressed stream then it affects the size
> of the resultant compressed stream in predictable ways that give
> you insight into what the rest of the stream contains.
Many others have made this comment.  But it's by no means the whole story.  Encrypted voice channels have been successfully attacked by looking at the lengths of encrypted packets.  Certain patterns of lengths are associated with particular sequences of phonemes.  This lets you read off the likely sound sequences even if you can't recover the details of sounds actually being transmitted.

The real problem here is that the underlying voice encoders are, in a sense, too good.  They are designed to make use of a little channel capacity as possible, so have short encodings for common combinations of sounds, which they can then transmit much faster than the time required to actually utter those sounds.  So there are gaps in the transmitted bitstream, which turn into gaps in the encrypted bitstream, which allows the lengths to be readily determined - and those are enough.

This is not a known plaintext attack, but it's part of the broad class of probable plaintext attacks - a very successful member of that class.  But it's also essentially a side-channel attack:  If you use the traditional setting for describing an encryption setting, where a vector of plaintext bits is mapped to a vector of cyphertext bits, there's no place to represent the timing information - so the system can look completely secure.

Note that there are attacks of a similar sort against interactive login sessions and such.  They tend not to have quite as much information available, and are not quite as successful as the voice attacks.

One way to look at the lesson here is that the mathematics makes sense if the only information available to the attacker is the encrypted stream of bits, and any departure from that - when the departure is correlated to the plaintext - provides a potential attack.  For voice, what you probably want to do is send data at a fixed rate, regardless of the sound being encrypted.  A simple A/D converter, with no fancy speech processing, gives you that.  The fancier your vocoder, the more you'll end up leaking - unless you backfill with random bits until you can get an essentially constant rate.

Applying these arguments in other situations is complicated.  A single, simple bulk transfer of data might not be affected by compression - but if you imagine a situation where I transfer a file of the same underlying length each day, but compress before encryption, I'm leaking some semantic information, because the length of my transmission says something about the compressibility of the plaintext.

                                                        -- Jerry