[Cryptography] Reading encrypted generative AI chats

Mon Mar 18 15:14:41 EDT 2024

> On Mar 16, 2024, at 06:22, Jerry Leichter <leichter at lrw.com> wrote:
> 
> We've seen this before!  Well over a decade ago, there were demonstrated attacks against encrypted, compressed voice, based on exactly the corresponding properties.  A similar attack does really well at determining which sites someone is browsing even if all of them are https sites, or even through a VPN in some cases:  Pages on the Web today are constructed from many components and link to other pages on the same site.  The lengths of page and page segment downloads are highly description of the actual sites being browsed.

It's even more than that. There's long-standing research on VPNs to de-mulitplex them, and it's relatively easy to do with just packet timing information and some simple math. As it turns out, it's even easier to do this on constant-traffic VPNs, because the better you are at constant traffic, the more the small leaks are separated from noise. To Kent Borg, who hit return before I did -- yeah, you're absolutely right, this is traffic analysis. It was even invented in WWI and the general mechanisms are the same today, far more than that era's cryptology applies today. 

This specific mechanism also the same issue as peering through redactions. If the redaction is being done at the word level, you can guess which words have which sizes. Mitigations include using fixed width fonts, blacking out entire sentences or paragraphs, and so on. These are all analogues to the same issues. 

> 
> The attack is described (probably even by the authors) as a side-channel attack.  This is *wrong*, and it's exactly the kind of thinking that leads us to overlook these attacks repeatedly.  In the ideal world of math, an encryption algorithm simply maps strings to strings.  That's probably a reasonable description for encrypted data at rest, but it's dead wrong for what's probably the majority of encryption usage today:  Encryption of streams of data in segments delivered over time.  The lengths of those segments aren't "side channel" information - they are part of the data being transmitted.  Those lengths, in many protocols, may even appear explicitly inside the data being transmitted.
[...]

I guess?

You are certainly not wrong, it's just that sometimes it's really hard to deal with these things. I think I see two major cases here, and many minor ones. 

The major ones are layering and the problem of metadata.

A lot of what we're seeing here is a layering issue. They should bundle up the tokens in a reply and send the whole reply back. It's possible though that what they want is high responsiveness, and think it's cool to have the chatbot emulate a person who is typing by sending small blobs. (Remember line-oriented command lines vs character oriented ones? LAT or 3270, same issues.) I would not be surprised that the proximate, underlying issue is that someone used some web APIs that say that they'll make it all secure, and the developers decided not to roll their own crypto, 'cause that's bad, right?

Also a lot in here is the problem of metadata. What even is metadata as opposed to data? Even in restricted subcases, this is really hard. If we consider phone calls, where there are different types of legal requests -- metadata also called "non-content" (like phone number) and content (actual call) -- that have different standards for the request. It's easier to ask for metadata than data.

There's a wonderful paper by Bellovin, Blaze, Landau, and Pell on this issue. Phone calls only, trying to distinguish between content and non-content. Their paper is 99 pages, going into all the issues, and part of what they say is that there are plenty of places where non-content is shipped as content and content is shipped as non-content. The title of the paper? "It's Too Complicated: How The Internet Upends Katz, Smith, and Electronic Surveillance Law."

Thus, I'm nodding along with you while observing that it's indeed complicated even if not too complicated. 

As another anecdote, we once built an encrypting file server, and one of the features of it was that it enabled sysadmins to do full backups, restores and so on without revealing metadata, including file names and file sizes. In our description of it, we talked about side channels and so on, and noted that while there was a lot we could do, timing observations, etc. could often undo whatever we did, and was not fixable because of layering. Example: there's a megabyte file written out in 4K chunks. If it's written out sequentially, as a stream, then by typing you can put the apparently-anonymous chunks into a file by looking at the creation times on the chunks. Yeah, there are plenty of cases where there's other side effects that blur this, but an informed, interested observer can learn more than one might think they could.

	Jon