<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Mon, Jul 13, 2015 at 4:59 AM, Zooko Wilcox-OHearn <span dir="ltr"><<a href="mailto:zooko@leastauthority.com" target="_blank">zooko@leastauthority.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">So, I'd suggest that if full Blake2 is too CPU-intensive for this<br>

usage, then reduce the rounds of Blake2, but keep as many rounds as<br>

you can tolerate. But definitely not fewer than 3 rounds. Or 4. Maybe<br>

5. ☺ Hopefully that would make it close enough in performance to<br>

SipHash for this usage.<br>

<br>

Regards,<br>

<br>

Zooko<br>

<div class="HOEnZb"><div class="h5">_______________________________________________<br>

The cryptography mailing list<br>

<a href="mailto:cryptography@metzdowd.com">cryptography@metzdowd.com</a><br>

<a href="http://www.metzdowd.com/mailman/listinfo/cryptography" rel="noreferrer" target="_blank">http://www.metzdowd.com/mailman/listinfo/cryptography</a></div></div></blockquote></div><br></div><div class="gmail_extra">A possible solution to this speed problem is to generate a small number of full secure hashes, and then rapidly generate a large number of fast hashes from them.  It complicates verification a bit, but not much.  If two non-cryptographic hashes collide, you just need to verify how they were generated from the secure hashes.</div><div class="gmail_extra"><br></div><div class="gmail_extra">For example, a hash value can be indexed by an (x, y) pair, where x is the value used in counter-mode to generate the secure hash h = H(block-digest, x), and y is the counter-mode value used to generate the fast hash from it f = FH(h, y).  Verification requires computing FH(H(x1), y1) == FH(H(x2), y2) where x1 != x2 and H(x1) != H(x2).</div><div class="gmail_extra"><br></div><div class="gmail_extra">The security analysis is tricky.  If there is a mathematical solution to finding y1 and y2 to generate a collision given h1 and h2, then an attacker can use this and avoid the full brute-force collisions search.  For example, if FH(h, y) = h + y, then it becomes trivial to verify if there are any values of y1 and y2 in the legal range that lead to collisions for any given h1 and h2.  In this case, there are collisions iff abs(h1 - h2) <= maxY.  I've been using maxY == 32, and FH does xors, additions, and rotations to make it hard to find simple mathematical techniques to find FH collisions faster than computing all 64 FH values.  A single Blake2b hash round seems like a good candidate for FH, for example.</div><div class="gmail_extra"><br></div><div class="gmail_extra">Assuming FH is reversible (which is true in the functions I've tried), one possible attack is to guess a colliding value C for FH given x1 and x2, and see if there are valid y1 and y2 to generate C.  If maxY is large, then this attack might be faster than computing all the FH values looking for collisions.  Since maxY is small (like 32), in a space of very many possible FH values (2^50 of them), a scheme where we have to guess C first wont work out.</div><div class="gmail_extra"><br></div><div class="gmail_extra">I think this is secure for small values of maxY, and an HF that resists algebraic analysis, such as ARX hashes.</div><div class="gmail_extra"><br></div><div class="gmail_extra">On the down-side, even though I can now generate hash data at the full bandwidth to my external DRAM, the hash-table lookups still heavily dominate runtime, even when the tables fit into L2 cache.  An FPGA with similar memory bandwidth should be able to eliminate this extra latency, and certainly we can do it with a custom ASIC.  It seems that setting random bits in L2 is a slow operation.  I think the fastest algorithm I've come up with so far is to do 2 rounds of radix-sort on the generated hashes, which has very good cache performance.  After that, the buckets of hashes to be checked should fit into L1 cache.  On my laptop, this agorithm can generate 1 GiB of hash data in 0.2 seconds, but one radix-sort rouund takes 0.6 seconds.  The two radix-sorts and the final L1 hash table lookups run around 1.5 seconds.  However, the simple algorithm of just storing hashes into a large external hash table takes only 2.7 seconds, so the speed-up is less than 2X.  This is not very impressive.  A custom circuit will beat this runtime by the difference in my bandwidth vs theirs.  I write 1 GiB twice, and read it twice, in 1.5 seconds.  With a better on-chip memory designed for this partial sorting, I would only have to read and write memory once.  An FPGA version with 25 GiB/s to memory should run 10X faster than me.  I can improve with 2 processes in parallel, possibly reducing the FPGA advantage to 5X-ish.</div><div class="gmail_extra"><br></div><div class="gmail_extra">I don't know why radix sort is so slow.  When using 1 bucket, I generate 1GiB of hashes in 0.21 seconds.  With 2, it increases to 0.4 seconds, which makes some sense.  With 256 buckets, it takes 0.6 seconds, and that required using Intel's non-temporal streaming intrinsics.  I do not understand why there is a decrease in speed when going from 2 buckets to 256, since I cache the data in each bucket in a 4 KiB cache, and only write then to main memory when full.  If I could get this to run at the full memory bandwidth, I'd be in good shape.</div><div class="gmail_extra"><br></div><div class="gmail_extra">Bill<br></div></div>