Serpent close to AES speed thanks to SSE2

Thu Sep 10 04:49:01 EDT 2009

http://www.randombit.net/bitbashing/programming/serpent_in_simd.html

Wed, 09 Sep 2009

Speeding up Serpent: SIMD Edition

The Serpent block cipher was one of the 5 finalists in the AES competition,
and is widely thought to be the most secure of them due to its conservative
design. It was also considered the slowest candidate, which is one major
reason it did not win the AES contest. However, it turns out that on modern
machines one can use SIMD operations to implement Serpent at speeds quite
close to AES.

Serpent uses an interesting bitsliced design with eight 4 bit sboxes which
are computed in parallel using boolean operations on registers. Rather than
splitting up the 32 bit words into nibbles and passing them through table
lookups, a special instruction sequence is used which performs the same
operation using only instructions like AND, OR, XOR, and NOT. Typically these
are done using 32 bit register operations, but it was recently suggested that
SIMD operations such as those available in SSE2 or AltiVec could be used to
encrypt 4 blocks in parallel.

Most cipher modes, such as CBC and OFB, are iterative; after splitting the
plaintext into blocks, the input to the second block depends on the
previously computed ciphertexts. This data dependency means it is impossible
to use a block-parallel implementation in these modes. However some other
modes, including CTR, EAX, and XTS, do not exhibit this data dependency, and
allow for many blocks to be encrypted in parallel. So being able to compute
many encryptions in parallel is only useful for these modes. Fortunately,
CTR, EAX, and XTS are very useful, unpatented, and (in the case of CTR and
XTS) widely standardized modes.

Recently I implemented Serpent using SSE2 intrinsics in the botan
cryptography library. While not quite as fast as AES, it easily boosts
Serpents performance by a factor of over 2.5 on an Intel Core2 processor.

Up until now, botan has used a rather conventional block cipher interface
where only a single block of data (typically 64 or 128 bits) would be
encrypted at a time; processing multiple blocks required calling the function
multiple times, one for each block. However this completely hides any
parallelism from the block cipher implementation. So in the upcoming
development release (1.9.0), botan offers two new interfaces for block
ciphers:

   void encrypt_n(const byte in[], byte out[], u32bit blocks) const;

   void decrypt_n(const byte in[], byte out[], u32bit blocks) const;

which will process many blocks in a single call. In addition some mode
implementations (at this time, ECB and CTR) will batch their inputs into
larger groups. This will not only allow for parallel encryption using SIMD
techniques, it also improves instruction and data cache utilization for all
ciphers. Right now, the modes will batch 8 blocks of data together; it is
unclear if this is sufficient for the best performance, but in any case is
easy to modify by changing a macro value in build.h.

On a 2.4 GHz Intel Core2 with GNU C++ 4.3.3, I got these results:

Algorithm 	1.8.6 (MiB/s) 	1.9.0 (MiB/s) 	Speedup

Serpent/ECB 	42.1 	113.5 	2.7

Serpent/CTR 	39.7 	100.8 	2.5

AES-128/ECB 	112.7 	134.4 	1.2

AES-128/CTR 	99.1 	114.1 	1.15

The AES speedups nicely demonstrate that even without any explicit SIMD
operations, the improved cache utilization can make a pretty big difference.

I also experimented with performing 8 Serpent block operations in parallel,
by interleaving two 4-wide SIMD encryptions. This reduced the number of key
variable loads, as well as offering the processor much more in the way of
independent computations for hot hot superscalar action. On my Core2, this
pushed Serpent's performance north of 160 MiB/s in CTR mode, which is pretty
impressive considering that is right about the speed of OpenSSL's AES-128
implementation on the same platform. However this variant seems slower on
anything but a Core2; tests on an Opteron showed it to be somewhat slower
than 4-way SIMD, and it is highly likely that it would also be much slower on
32-bit x86 processors due to excessive register pressure.

AltiVec looks to be an even more promising platform for multiblock Serpent
encryption, as it includes native rotation instructions, which in SSE2 must
be emulated using two shifts and an OR. It is very likely the Cell processors
SIMD units could also implement Serpent in a SIMD mode. Considering the Cell
SPE contains 128 SIMD registers, it might even be feasible to implement a
variant suggested by Wei Dai of encrypting 128 blocks in parallel without
suffering an excessive number of register spills.

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com