[Cryptography] SW requirements to block timing side-channel attacks ??

Mon Jul 17 10:31:06 EDT 2017

> Ok, suppose we take a cue from the design of some microcodes, which have a "time duration" field to specify how many minor clock cycles this microcode instruction should take to allow all the buses to "settle" before moving on.
> 
> We can incorporate such a *duration* parameter in each call to a crypto routine; the crypto routine has to process this parameter *first*, so as not to allow the contents of any other parameter to affect overall timing.  The CPU's real-time clock is read, and the parameter is added to the value of the real-time clock to set an alarm clock to indicate when the crypto routine is to return.  The crypto routine is executed, and a barrier synchronization then waits for the previously computed alarm clock time to arrive before the crypto routine is allowed to return....
And right here the idea breaks down, as you're assuming that whatever you use to waste time up to the limit if you finish early will be externally indistinguishable from actual cryptographic computations.

I don't think people appreciate just how complex modern hardware is.  Here's something I just recently learned:  The last 2 (3?) generations of Intel chips have non-uniform memory access *on chip*.  When you look at the chip, you see some number of cores, and some amount of memory.  They all look the same.  But on the chip, the cores are grouped into blocks, each with some fraction of the memory; and the blocks are on an on-chip ring.  (Early versions used a uni-directional ring; later versions a bi-directional ring.)  So memory latency depends on whether you are accessing memory in the same block as the current core, or memory in some other core; and if on some other core, how far away it is along the ring.  In fact, the difference (viewed at a high level) comes to over 30% (after caches are considered).  If you look just at raw access times, the difference is much greater.  Also, there's a distributed cache consistency algorithm *running inside the chip itself*.  So you could probe what other cores are using from the cache by looking at cache hits and misses on yours.

Oh, and BTW, Skylake replaces the ring with a crossbar - whose detailed performance characteristics no one outside of Intel yet knows.

On top of all this, OS's have complex algorithms that try to optimize usage in the face of the complexity of the hardware - e.g., trying to allocate memory in the same block as the core running the thread that allocated it (which can actually fail badly if the design of the application allocates memory in one thread - e.g., a UI responder - and promptly passes it off to another).

Given this level of complexity, I think the notion of "really secure" cryptography running on general-purpose hardware is nonsense - if you define "really secure" as safe against all kinds of side-channel attacks by NSA and its friends - not to mention attacks through holes deliberately or accidentally introduced in the huge amounts of microcode that runs all this stuff.

Either you define your security needs at a lower level, or you build your own hardware from the ground up.  Design your system so that "red" data flows only through your blessed hardware; only allow general purpose hardware to see "black" data.
                                                        -- Jerry