[Cryptography] letter versus spirit of the law ... UB delenda est

Sat Oct 24 09:39:31 EDT 2015

> 
> In the context of UB (i.e. Undefined Behavior) in the specification
> for the C language, let's consider the /spirit/ of the law.  As 
> a first example, let's consider signed integer overflow.  By all 
> accounts, when the specification was being drafted, a handful of 
> possibilities were considered.  These were based on the behavior
> of the hardware of the day...
And the *spirit* of the language was that C gave you more-or-less direct access to the fastest operation available on the machine; you could then proceed to build other semantics on top of it if you wished.

>  ...There is no advantage in getting the wrong answer quickly.
"The right answer" is what the Standard says you're supposed to deliver.  There is no other definition.

You can argue about whether that's a *useful* answer, but that's a different question.
>  Bypassing security 
>  checks is not really an optimization.  A machine that 
>  is not secure cannot be relied upon for /any/ purpose.
> 
>  Requiring people to use unsigned ints to the exclusion
>  of signed ints when writing reliable code is not really
>  an optimization ... and isn't sufficient anyway.  Using
>  unsigned ints gets rid of the UB with respect to overflow,
>  but other UB remains, including e.g. x<<y....
With the exception of shift, unsigned's have completely defined behavior:  They provide arithmetic mod 2^n.  So according to you arguments, they are a safer choice.

However, in my 40+ years of writing code, I've been burned more than once by unsigned overflow (e.g., in loop bounds), but I can't recall a single case of being burned by signed overflow.

If you stay far away from the maximum and minimum representable values, and avoid division, either signed or unsigned arithmetic is an accurate model of the ring of integers with ordering.  It fits your natural intuitions; things work as you expect.  The problem with the unsigned integers is that it's kind of hard to stay away from the minimum representable value of 0!

When integers were commonly 16 (or even 8!) bits long, you really had to watch what you were doing even with signed values.  But as we went to 32- and now 64-bit integers, in the overwhelming majority of uses made for integers - typically, indexes into arrays, loop repetition counts, and counts of "things" (real-world objects, instances of objects in memory) - signed integer values will never stray anywhere near the boundaries and programs will have the expected behavior.

Yes, there are plenty of attacks based on pushing integers out to their representational boundaries.  These are best dealt with at the edge, before they get deep into the code.  If you don't bounds-check your inputs, overflows are just one problem you're prone to.

And, yes, if you're writing code that has to deal with possibly-hazardous inputs - which is a large fraction of code these days, perhaps even most code - you should have access to some basic tools, like safe range checkers.  And of course you should *use* them.

Integers, unsigned or signed, with any semantics you can define for overflow, are not Z.  Floating point values are not R.  Whether exactly what happens when you push these approximations beyond the limits is fully specified or left undefined is less important, in actual practice, than the likelihood that your code will wander into those areas to begin with.  (And, yes, in security "likelihood" changes drastically because an attacker can manipulate it.)

Over the years, there have been different approaches to dealing with this reality:

-  Google programming standards for C++ forbid the use of unsigned integers, except in very limited special cases:  The view is that the original need for unsigned (to be able to represent larger integer values in a 16-bit word) has long disappeared and what remains is the hazards near 0.
- Ada specified that integer overflow would throw an exception - and the Ariadne rocket crashed because exactly that happened, and there was nothing to handle the exception.  If you don't think about the edge cases, your programming language can't help you when you hit them.
- APL many years ago decided that the difference between integer and floating arithmetic (and, for that matter, booleans) was a pointless complication, so internally a variable may be represented as a bit (mainly relevant for arrays - APL programs often manipulate very long bit arrays as part of selection), an integer, or a float, and the system will convert as necessary.  (Division is always done on floats; if you want the result of an integer division, you truncate the result.)  APL also defines equality as "within epsilon" (for a suitable small epsilon) and carries this through consistently, so in fact (1/3)*3 prints 1.  You can easily construct oddball edge cases but for the most part this works well for computational purposes.  I imitated this semantics in a special-purpose language I designed years ago, and it worked well there, too.
- IEEE floating point adds Infinity and NaN to the represented values and has a well-defined semantics for them.  It works well in many cases, but not in all.  A classic example I've seen as the cause of real-world bugs is computing the inner product of a weight vector with a sequence of values, expecting that a 0 in the weight vector will cause the corresponding value to be ignored.  That fails as soon as a value can be infinite or NaN.
- Java chose to simply not support unsigned arithmetic.  It signed arithmetic is defined to be 2's complement, and overflow/underflow behavior is defined simply as "keep the lowest 32 (64) bits of the result".  While completely specified, this has no reasonable mathematical semantics.  If you do integer arithmetic in Java, stay away from the representational boundaries.  Meanwhile, Java originally choose a super-strict, fully-defined interpretation of IEEE arithmetic, guaranteeing bit-by-bit equality on all implementations.  It turns out that no one much cares for bit-by-bit equality, but they do care about performance - and the original Java requirements killed performance on anything other than SPARC.  So that got dropped. 
- Modula-3 had two built-in integer types, INTEGER and CARDINAL.  CARDINAL was *not* unsigned:  It was a subtype of INTEGER consisting of the non-negative values.  You could make your own range subtypes.  Exceeding the bounds of the subclass representation trapped.  "Unsigned" was available as a library implementing a class, which had its own operations (like ADD) - no possibility of confusion about what "+" meant, the class implemented arithmetic mod n.  I don't recall if it provided division or even ordering.
- C originally allowed arbitrary algebraic re-ordering as long as it was "mathematically" correct, i.e., of the reals.  This caused grief for numeric programs written to stick carefully to FP, not real, semantics.  Eventually the standard was changed to limit the reordering of FP expressions.  This was explicitly not done for integer expressions because it was felt that too little code made use of such careful handling and significant optimization possibilities in real, everyday code would be lost.  And the fact is that's probably true.  Some of the optimizations being done today arguably have no benefits in any real code, so they get the bad side-effects without the good primary effects - but that's, as always, a "quality of implementation" issue.
- When Knuth developed TeX, he wanted the results to be the same on all machines, which would not be the case even though he was writing in Pascal, which was much more tightly specified than C.  So he wrote an infinite-precision rational arithmetic package that's used for all computations that affect the final result.  FP is used only in computations that produce log messages and nothing else.
- LISP's have had infinite-precision integer and rational arithmetic for many years.  How much they are used in typical LISP programs is another question - but they are there.  (Even Java has an infinite precision integer library - it's the right thing to use when computing monetary values.)  Of course, in each case, you pay significantly in performance.

Many problem spaces, many solutions.  No absolute answers.

                                                        -- Jerry