[Cryptography] Apple Group FaceTime and end-to-end and encryption

Wed Jun 6 06:51:45 EDT 2018

Apple's Facetime uses end-to-end encryption, with Apple's servers unable to decrypt the message streams.  Facetime has also only supported one-to-one connections, for which it's been criticized.  The explanation I had heard - and have given - is that effective group communication requires fairly sophisticated *mixing* of the data streams from all the participants in a group connection.  At the least, you need to balance the different audio streams to a consistent volume level, and typically you want to do some kind of focus, in which the user providing the loudest audio is emphasized somewhat over the balance of the others.  But you also want to suppress noise if possible.  This requires quite a bit of computation, and has traditionally been the domain of the server.

Video is typically not mixed, but streams are selected and composited.

But ... to do this sort of stuff in the server, the server needs access to the actual audio and video, not just the encrypted streams.

Now Apple has announced that the next release of iOS will support group connections with up to 32 members.  I'm wondering "how they plan to do that".

There are two components to the problem:

1.  Doing the necessary computations.  Recent iPhones certainly have the necessary compute power to handle mixing of 32 audio streams.  They also have GPU's that should be able to handle the compositing and other video processing.
2.  But ... you can only do computation *on data you actually receive*.  Sending all 31 audio streams to all phones at all times seems plausible.  Thirty one full video streams seems unrealistic.

I can imagine various tricks in which phones send servers some basic metadata - e.g., average loudness and both a full and a low-data-rate video stream.  The server could then pick the loudest audio streams and forward high-data-rate video for that one and low-data-rate video for the rest.  That does, of course, leak some metadata to the server.  (Then again, given that the audio almost certainly uses some kind of delta encoding, the rate of the encrypted audio is probably already leaking information about the rate of change of the audio, and louder audio is typically more changeable so could probably already be identified.)

Anyone know anything yet about how Apple is doing this?  (One would think they would have patented the techniques needed....)
                                                        -- Jerry