[Cryptography] RISC-V branch predicting

Wed Feb 7 18:54:27 EST 2018

On Feb 6, 2018, at 6:53 PM, John Levine <johnl at iecc.com> replied to my post:
> 
> In article <8FAA204A-D24D-48B4-AFD3-A7692C416846 at me.com> you write:
>> Given the apparently unbounded Spectre security risks presented by current branch predictors, maybe it is time to reconsider the
>> division of labor between clever CPU hardware and the software development tool chain. In most cases the programmer knows which
>> way branches are likely to go, optimizing compilers can make good guesses and profiling tools should be very effective if given
>> realistic data. I suspect that most execution time in software people use these days is spent in libraries that can be highly
>> tuned. Is there any research that says the additional gain from sophisticated hardware branch prediction is significant in
>> practice? Is it significant enough to justify the security problems? 
> 
> Back when VLIW was young, the Multiflow compiler could use runtime
> stats from previous runs when recompiling.  Since VLIW is all about
> speculative execution, it made a lot of difference.  But in practice
> people didn't like it, didn't want to do it, and voted with their feet
> to move to machines where you didn't have to do that.
> 
> It is my impression that it is hard enough to get adequate test cases
> to test the correctness of libraries.  If we were supposed to get test
> cases that weren't just minimal exercises of the code but realistic
> uses, how likely is it that people would really do that?  For
> libraries doing heavy computation with runs of hours or days, sure,
> but how about something like crypto libraries?  Maybe certificate
> handshakes and channel encryption take enough time to be worth
> profiling and optimizing, but who's going to collect the data and pay
> people to do it?

I have no doubt that people would rather let the hardware do the work if given the option. But if the goal is strong security, there may not be a choice.

As I understand it, test cases for correctness are hard because of the need to exercise all paths in the code to look for edge cases. Performance tests should be easier to acquire by sampling real world usage. Indeed I don’t see how performance critical libraries can be developed in the first place without good test cases. Certainly problems like face recognition or speech to text or machine learning are driven by standard corpuses of test files. There is undoubtedly a Pareto distribution of libraries: only a fraction of them matter for noticeable performance, so libraries can be optimized over time, hopefully starting with the most impactful  ones. Crypto libraries in particular are (or should) be carefully analyzed, not just for optimal performance but to minimize the risk of side channel timing attacks.

> On Feb 7, 2018, at 12:43 PM, Nemo <nemo at self-evident.org> wrote:
> 
> Arnold Reinhold <agr at me.com> writes:
> 
>> In most cases the programmer knows which way branches are
>> likely to go, optimizing compilers can make good guesses and profiling
>> tools should be very effective if given realistic data.
> 
> Wrong on every count. For example:
> 
>    https://stackoverflow.com/q/11227809/
> 
> Granted, that is just a 20K-upvote StackOverflow question. But it is
> extremely common for branch patterns not to determined until
> runtime. Also note that the performance delta in this example is 6x.

I read the stackoverflow thread and, with all due respect, I think it supports what I am saying. The 6x performance delta example involves summing some elements out of a large array of random numbers based on a conditional test. The sum runs 6x faster if the array is sorted beforehand.  The explanation given is that with the sorted array, the sum’s conditional test branches more predictably.  It’s a very simple example, but if anything the unsorted case is more representative of real world situations. The stackoverflow thread then goes on to discuss how several programming techniques, such as conditional moves and table lookup, can make the unsorted sum almost as fast as the sorted sum, and suggests several compliers already make such optimizations. I think this supports my notion that most of the benefits of highly complex branch prediction hardware can be realized by careful programming along with an improved tool chain.

> Polymorphic method invocations (e.g. C++ "virtual functions") are
> similar. Where it is possible for the compiler to make a determination
> ("de-virtualization") or a good guess about the target, they already do
> so. But it is still very common for patterns to exist only at runtime.

I don’t pretend to understand the many pathologies possible in C++ code generation, but I do not see why instrumented tests on sampled user data can’t suss out most of them during program development.  

> Eliminating speculative execution would be a disaster for
> performance. It would also be stupid, because the real problem is not
> speculation per se, but covert channels between privilege domains
> (e.g. cache timing attacks).

I am by no means suggesting the elimination of speculative execution. I’m proposing that the RISC-V community skip creating the elaborate hardware mechanisms for predicting which way branches go and push that burden upstream to programmers and the development tool chain. The Spectre paper suggests that there is no magic bullet to eliminate the problem of exploitable side effects from the complex branch prediction mechanisms in modern processors. And even if a brilliant solution emerges, it could well be patented and unavailable to open hardware. Given the risk of state actors building in vulnerabilities, open source software on predictable open hardware seems an essential platform for cryptography. I’m just suggesting an alternative approach that lowers hardware development costs and makes program execution behavior much more predictable, while potentially getting most of the performance benefits. 

Arnold Reinhold