[Cryptography] Reading encrypted generative AI chats

Mon Mar 18 14:31:43 EDT 2024

On 3/17/2024 3:55 PM, Kent Borg wrote:
> On 3/16/24 06:22, Jerry Leichter wrote:
>> https://arstechnica.com/security/2024/03/hackers-can-read-private-ai-assistant-chats-even-though-theyre-encrypted/  describes an attack that reads - actually, guesses with good accuracy - the responses of generative AI programs, even though they are sent through a TLS connection.
>>
>> […]
>>
>> The attack is described (probably even by the authors) as a 
>> side-channel attack.  This is *wrong*, and it's exactly the kind of 
>> thinking that leads us to overlook these attacks repeatedly.  In the 
>> ideal world of math, an encryption algorithm simply maps strings to 
>> strings.  That's probably a reasonable description for encrypted data 
>> at rest, but it's dead wrong for what's probably the majority of 
>> encryption usage today:  Encryption of streams of data in segments 
>> delivered over time.  The lengths of those segments aren't "side 
>> channel" information - they are part of the data being transmitted.  
>> Those lengths, in many protocols, may even appear explicitly inside 
>> the data being transmitted.
>>
>> Attacks based on characterizing sequences of lengths should be seen as 
>> akin to dictionary attacks.  Just as we expect cryptosystems today to 
>> resist dictionary attacks (by adding randomness to the encryption thus 
>> avoiding encrypting the same data repeatedly), we should expect them 
>> to resist attacks against segment lengths.
> 
> Hasn't this been called "traffic analysis", since WW II?
> 
> A few years ago put a lot of work (and thought) into building a 
> prototype of a system with end-to-end encryption and I considered this 
> question. It was a system that reported camera data, from cameras that 
> are usually idle. I certainly considered—and immediately discarded—the 
> idea of sending a continuous stream of data as impractical/. But, /I 
> didn't ignore the issue completely. I aggregated and padded content out 
> to various coarse fixed size boundaries before encrypting and sending. I 
> do admit that I did not fuzz my padding to variable boundaries. (I 
> think, this was a long time ago.) But I also never got past a 
> demonstration prototype.
> 
> An observer of my system would certainly be able to /easily/ measure 
> activity by watching my data flows, but so could an observer measuring 
> electricity usage, water or natural gas usage, lights visible around 
> edges of window shades, pizza deliveries, etc. I considered my 
> vulnerability a known vulnerability, knowingly chosen, something to be 
> disclosed. But my padding was coarse enough that traffic analysis would 
> reveal nothing specific enough to be considered a picture.

Traffic fingerprinting is a real tool. Tools are trained with machine 
learning to analyze traffic patterns like size and timing of messages. 
It is possible to recognize what web site is accessed, or what video is 
being watched, even if the traffic is encrypted. See for example 
"Network-Based Website Fingerprinting" 
(https://www.ietf.org/archive/id/draft-irtf-pearg-website-fingerprinting-01.html), 
or this Cisco blog about building TLS fingerprinting in their network 
monitoring products 
(https://blogs.cisco.com/security/tls-fingerprinting-in-the-real-world).

Defense is hard. The first step is indeed to avoid providing too much 
information through packet lengths, by standardizing to just a few 
lengths. The next step is to inject chaff to try to break analyzes of 
packet timing. This is useful, but you need very large amounts of chaff 
to break modern machined-learned algorithms, and such large amounts are 
not practical.

I wish there was more research on this subject. We probably need new 
tools. For example, multipath transport protocols could allow splitting 
the traffic on multiple paths, so that none of them exposes something 
that can be recognized. Similar obfuscation could be used by sending 
segments of traffic through onion routing. But it is globally an area 
were the defense of privacy is not winning.

-- Christian Huitema