[Cryptography] Computing the speed of a cipher or hash

Tue Aug 26 21:17:25 EDT 2025

> On Aug 25, 2025, at 14:40, Jerry Leichter <leichter at lrw.com> wrote:
> 
>> Speeds of ciphers and hash functions are expressed in cycles per byte. I can 
>> measure the time it takes to process many bytes with the @time macro, but how 
>> do I convert that to cycles? I can look at /proc/cpuinfo while the code is 
>> running, but it shows that some cores are running much faster than others, and 
>> the clock speed varies....
> The notion of "cycles" as some kind of invariant measure for a modern CPU is essentially meaningless.  Besides various internal complexities what with multiple units and caches and such, all modern CPU's are capable of adjusting their speeds in response to load and, particularly, temperature.  Many CPU's are able to run bursts at (sometimes significantly higher) clock speeds, but they can only keep that up for a little while before they get too hot and have to back off.  This would produce very different numbers for "small" vs. "large" inputs.

I have to disagree with you, some. It's not essentially meaningless, it's just complex.

The reasons it's not meaningless includes your observation that modern CPUs have adjustable clocks and so the very notion of "clock speed" of a CPU is an abstraction; there's no piece of quartz beating like a metronome, there's a conductor who speeds up when the music is dramatic and slows down when the orchestra starts to sweat.

Thus, a meaningful benchmark can't have hertz in it at all, because there's no such thing as a fixed clock. You have to measure in clocks per byte, or something akin to it. Clocks per byte has the advantage that it's intuitive, even if not really precise.

> 
> As far as I can tell, any cycles/byte rating of code for a modern CPU is either (a) actually for some fictional CPU where we count one cycle per "simple" instruction, like XOR or addition, and avoid "complex" instructions, which you'd expect any modern cipher or hash function to do anyway; (b) based on some kind of measurement that relies on, e.g., /proc/info - which really doesn't mean anything except in the _exact_ scenario under which it was obtained.

Well, mostly it's an actual CPU, like an x64 AMD/Intel CPU or an ARM64 CPU. A real world example is that in the SHA-3 competition, NIST told the submitters that their target for the submission was an Intel Core 2 Duo CPU, and we all measured our algorithms on actual clocks on an actual CPU, even if that CPU is not being used by anyone today. Their successors are similar, and the ARM64 architecture took into consideration the importance of cryptographic algorithms.

Continuing with that, as it turns out the Core 2 Duo has micro-operations as well as actual real clock cycles. There's three micro-ops per clock cycle, and as it turns out, ADD, XOR, and ROT are all one micro-op. Note that this is a huge boon to any ARX architecture algorithm in a happy alignment of happy little trees.

So quite a few algorithms (Skein and Blake I know for certain) used this for a speedup. Skein specifically designed its MIX function, an ARX combo, to take a clock cycle. So, yeah, Skein, Blake and some others were effectively tripling their clock-per-byte speed by designing for that CPU. Even algorithms not specifically designed that way took advantage of if it because optimizers like to optimize. SHA2, and other ARX-based S-P network algorithms became dominant because they got improvements in clocks per byte for free. 

There was also an unfortunate generation of Intel CPUs that had ROT take two micro-ops, just as content-addressable storage and heavy use of cryptographic functions was dropping on the world. I could write a couple hundred words on the details. The summary is "oops."

> 
> In practice, (a) above is reasonable if your purpose is to compare different algorithms and their implementations, because the model isn't that far off from the performance that modern machines can provide - as long as you don't take the number too literally.  (Of course, if you have something like an algorithm that relies on lookups in a large table, the _actual_ performance will likely be dominated by cache/memory access effects, which won't show up in any simple model.  But no one with any experience would suggest such an algorithm today.)

Yes, you can't take the number too literally because you never know when there's going to be something in the hardware, like an ARX triplet taking four micro-ops instead of three. However, these days, the two dominant architectures are designed the way they are because cryptography is important, and the up-and-coming ones, like RISC V, know that they have to have ARX work well.

At the same time, you have a point, that clocks-per-byte are not always exact. However, they're calculated not on an abstract machine, but on a real one and that has indirect effects. When I was working on Skein, Skein-512 was very fast both because of the micro-op thing, but because all the intermediate results could be held in registers, but Skein-1024 did not fit in registers so its clock-per-byte speed was a bit lower. On a CPU with a larger register file, you can do more.

This has real-world instantiation with the SHA2 family. SHA-256 and SHA-512 are basically the same function but designed for 32-bit registers and 64-bit registers. On a 64-bit CPU, SHA-512 actually runs faster than SHA-256. Some Intel people designed a thing they called SHA-512/256, which was SHA-512 with 256-bit output for the added speed. They also generalized it to SHA-512/t, where t is the output size. See <https://eprint.iacr.org/2010/548.pdf> for details.

	Jon