Fast MAC algorithms?

Thu Jul 23 22:24:41 EDT 2009

> >2) If you throw TCP processing in there, unless you are consistantly going to
> >have packets on the order of at least 1000 bytes, your crypto algorithm is
> >almost _irrelevant_.

This is my experience, too.  And I would add "and lots of packets".
The only crypto "overhead" that really mattered in a real application
was the number of round-trip times it took to negotiate protocols and
keys.  Crypto's CPU time is very very seldom the limiting factor in
real end-user application performance.

> Could the lack of support for TCP offload in Linux have skewed these figures
> somewhat?  It could be that the caveat for the results isn't so much "this was
> done ten years ago" as "this was done with a TCP stack that ignores the
> hardware's advanced capabilities".

I have never seen a network card or chip whose "advanced capabilities"
included the ability to speed up TCP.  Most such "advanced" designs
actually ran slower than merely doing TCP in the Linux kernel using an
uncomplicated chip.  I saw a Patent Office procurement of Suns in the
'80s that demanded these slow "TCP offload" boards (I had to write the
bootstrap code for the project) even though the motherboard came with
an Ethernet chip and software stack that could run TCP *at wire speed*
all day and night -- for free.  The super whizzo board couldn't even
send back-to-back packets, as I recall.  Some government contractor
had added the "TCP offload" requirement, presumably to inflate the
price that they were adding a percentage markup to.

As a crypto-relevant aside, last year I looked at using the "crypto
offload" engine in the AMD Geode cpu chip to speed up Linux crypto
operations in the OLPC.  There was even a nice driver for it.
Summary: useless.  It had been designed by somebody who had no idea of
the architecture of modern software.  The crypto engine used DMA "for
speed", used physical rather than virtual addresses, and stored the
keys internally in its registers -- so it couldn't work with virtual
memory, and couldn't conveniently be shared between two different
processes.  It was SO much faster to do your crypto "by hand" in a
shared library in a user process, than to cross into the kernel, copy
the data to be in contiguous memory locations (or manually translate
the addresses and lock down those pages into physical memory), copy
the keys and IVs into the accelerator, do the crypto, copy the results
back into virtual memory, and reschedule the user process.  In typical
applications (which don't always use the same key) you'd need to do
this dance once for every block encrypted, or perhaps if you were
lucky, for every packet.  Even kernel crypto wasn't worth doing
through the thing.  And the software libraries were not only faster,
they were also portable, running on anything, not just one obsolete
chip.

Hardware guys are just jerking off unless they spend a lot of time
with software guys AT THE DESIGN STAGE before they lay out a single
gate.  One stupid design decision can take away all the potential gain.
Every TCP offloader I've seen has had at least one.

	John

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com