Rijndael in Assembler for x86?

Wed Sep 19 05:42:42 EDT 2001

jamesd at echeque.com wrote:

>  Perry E. Metzger <perry at piermont.com> wrote:
> > >Because it is typically slower by many times than hand
> > >tuned assembler.
>
> On 14 Sep 2001, at 14:24, Ian Goldberg wrote:
> > Are you sure?  For general code, that certainly hasn't been
> > true in a long time; optimizing compilers nowadays can
> > often do *better* then hand-coded assembler.
>
> So say compiler writers.
>
> I have not found this to be true.  Perhaps it is true of some
> compilers and some people's assembler, and some code.

I've done quite a bit of assembler for crypto
in the last few years and it very much depends
on the CPU/compiler (obviously).  The only platform that I have not
been able to beat the compiler by %10 or more for an algorithm is
PA-RISC.  Either my C code is good enough to give the compiler all
the help it needs or I need to revisit the architecture :-).
The biggest wins are for algorithms where the C compiler does not
give access to underlying primitives, ie 32*32->64, or where special tricks
can be used due to data relationships that the compiler cannot know about.

For x86, there are too few registers and lots of black magic going on.
At least for the pentium VTune would reveal everything.  For the Ppro/II/III,
depending on the compiler (gcc vs VisualC) the C code could sometimes get
within %30 of the ASM.
For the Pentium 4, ASM is good again.  It seems to be a very 'brittle' CPU.
Eg. for sha1 (The numbers are relative but may or may not have any relation
to something in the real world :-)
                    | P4     |Athalon   |P2 Celeron
                    | 1.7ghz | 1.4ghz   | 333mhz|
                    |lnx gcc |cygwin gcc|lnx gcc|
SHA1 586            |  78.594| 135.937  | 26.038|
SHA1 686            |  81.986| 141.996  | 32.481|
SHA1 786            | 135.419| 137.804  | 29.106|
SHA1 fast           |  47.864|  83.846  | 20.828|
SHA1 small          |  54.322|  62.599  | 12.534|

Notice how the different assembler version are all around the same
speed for the P2 and Athalon. Even the ratio between the
C code versions is similar.  But now look at the P4.  Special
magic can be used to make things very fast, and the 'small' C code
version is faster than the loop unrolled version
(trace cache thrashing?)

For PA-RISC, I've done 1.1, 2.0 and 2.0W code and for some algorithms I cannot
beat the optimizer.  For others, specifically bignum stuff, 2 to 4 times faster.  In
this case
all multiples are done in the FP unit and data has to be swapped between CPU and FPU

via memory so there are lots of opportunities to use 64bit loads etc.
HP has good optimizers for a rather tricky architecture.

Sparc, I've only done digests, 30-40% speedup seem normal.  Simple architecture,
simple for the compiler to do a good job.

ARM, the compilers are good and the CPUs are simple, I consistently only get
20-40%.  The real win comes when trying to get fast with small code size.
It is possible to make things much faster for the same reduced footprint.  Xscale
could
be interesting since there are now inter instruction register dependencies.  I've
normally worked
on StrongARM where there are none.

Itanium.  Amazing speedups with assembler, but hard to write.  It is a vector
processor if there ever was one.  Anything that can do two 64*64->128
multiplies per cycle but with a 15 cycle latency, and no OOO  is going to be tricky.

For both ARM and PA-RISC I've been able to use instruction set
features to improve performance.  In theory the algorithms could be
coded in C, but it takes CPU architecture/compiler knowledge to write the code :-).

eric

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at wasabisystems.com