[Cryptography] GCC bug 30475

Fri May 2 22:12:04 EDT 2014

On May 2, 2014, at 6:02 PM, Nemo <nemo at self-evident.org> wrote:
>> "Some" implementations?  This is the original x86 FPU semantics, the
>> ones present on every 32-bit x86 CPU ever built.  I haven't had any
>> reason to keep up, and I don't know if the most recent FP instruction
>> sets still do the same thing; I suspect they might very well.
> 
> Only if your compiler actually emits x87 instructions, which it probably
> will not unless you ask for a "long double" and/or target very old CPUs.
> 
> Modern x86/x86_64 processors have SSE (or AVX) instructions, which
> operate on 2 (or 4) 64-bit doubles at a time with very low latency and
> proper IEEE-754 behavior. Modern compilers tend to emit such
> instructions, which are both faster and avoid the "extra precision".
The C standards - except possibly C11 - were written at a time when FP on an x86 meant the old instruction sets.  "Very old" isn't quite as old as you make it out to be.  To get double precision FP, you need SSE2 - SSE was single precision only, and C FP arithmetic is usually double precision anyway.  SSE2 was introduced in Pentium 4 at the end of 2000.  (Even SSE only takes you back about a year from that.)  The older chips were the majority for quite some time after that, and you'd certainly want to support them for at least 10 years.  Not so long ago....

I've never heard anyone complaining that the old FP instructions did not have "proper" IEEE-754 behavior.  IEEE-754 allows for a variety of approaches and the 80-bit long double type has been there as long as the spec; Intel didn't invent it.  I've also never heard an numerical analyst complain about the extra precision delivered by the old instruction set.  There was *plenty* of complaining by those familiar with these matters about Java's original attempts to force particular bit-level results on FP computations - it was those who wanted to use Java for numerical computations who wanted it to allow the use of the x86 FP hardware.

> Still, C and C++ are arguably too under-specified for serious numerical
> work.
C is rarely used for numerical work because it has no advantages over FORTRAN, which is what numerical analysts grew up with.  (Alan Perlis:  "I don't know what programming language numerical analysts will be using in YYYY, but it will be called FORTRAN."  I think YYYY was 2000 and well in the future when he first said this.)

C++, on the other hand, is widely used in certain numerical areas because one can build packages with types like arrays or polynomials that closely match the traditional syntax and semantics.  The one thing that held numerical computation back for a while was that compilers were given too much ability to do mathematically valid but numerically invalid transformations on FP code.  That was fixed by various modifications to the C Standard many years ago - mainly preventing the compiler from re-arranging the order implied by parentheses in FP computations.  The claim of "under-specification" is simply wrong.

>> Initially both x and y are 0
>> Thread 1	Thread 2
>> x = 1;          print x;
>> y = 1;          print y;
>> 
>> What can thread 2 print?  There are three obvious possibilities:
>> 
>> 0 0     - Thread 2 completes before thread 1
>> 1 1     - Thread 1 completes before thread 2
>> 1 0     - Thread 1 executes its first statement, then thread 2 completes
>> 
>> But 0 1 is impossible.
> 
> I believe you botched this example.
Ah, but that's because you think the column headers are x y; they are actually
y x!  :-)

You are, of course, correct - x=1 y=0 is the result of an interleaving in which Thread 2 runs between the two assignments in Thread 1; it's x=0 y=1 that appears to be impossible.  To expand a bit on why it might happen, even on a machine with a very simple and straightforward memory model:  If after these two assignments, y isn't used again for a while but x is used heavily, a compiler might decide to keep flush y to memory but keep x in a register.  For sequential code, this is perfectly correct - you could never tell the difference.  But a parallel thread *can* see the ordering of the writes to memory.  If a language doesn't explicitly support parallelism, a compiler for it will be *expected* to do these kinds of optimizations.  (In fact, if you're going to insist that *all* stores to memory occur in program order, you're going to seriously cripple any decent register allocation algorithm.)
                                                        -- Jerry