[Cryptography] Are Momentum and Cuckoo Cycle PoW algorithms broken?
waywardgeek at gmail.com
Mon Jul 13 14:29:25 EDT 2015
On Mon, Jul 13, 2015 at 4:59 AM, Zooko Wilcox-OHearn <
zooko at leastauthority.com> wrote:
> So, I'd suggest that if full Blake2 is too CPU-intensive for this
> usage, then reduce the rounds of Blake2, but keep as many rounds as
> you can tolerate. But definitely not fewer than 3 rounds. Or 4. Maybe
> 5. ☺ Hopefully that would make it close enough in performance to
> SipHash for this usage.
> The cryptography mailing list
> cryptography at metzdowd.com
A possible solution to this speed problem is to generate a small number of
full secure hashes, and then rapidly generate a large number of fast hashes
from them. It complicates verification a bit, but not much. If two
non-cryptographic hashes collide, you just need to verify how they were
generated from the secure hashes.
For example, a hash value can be indexed by an (x, y) pair, where x is the
value used in counter-mode to generate the secure hash h = H(block-digest,
x), and y is the counter-mode value used to generate the fast hash from it
f = FH(h, y). Verification requires computing FH(H(x1), y1) == FH(H(x2),
y2) where x1 != x2 and H(x1) != H(x2).
The security analysis is tricky. If there is a mathematical solution to
finding y1 and y2 to generate a collision given h1 and h2, then an attacker
can use this and avoid the full brute-force collisions search. For
example, if FH(h, y) = h + y, then it becomes trivial to verify if there
are any values of y1 and y2 in the legal range that lead to collisions for
any given h1 and h2. In this case, there are collisions iff abs(h1 - h2)
<= maxY. I've been using maxY == 32, and FH does xors, additions, and
rotations to make it hard to find simple mathematical techniques to find FH
collisions faster than computing all 64 FH values. A single Blake2b hash
round seems like a good candidate for FH, for example.
Assuming FH is reversible (which is true in the functions I've tried), one
possible attack is to guess a colliding value C for FH given x1 and x2, and
see if there are valid y1 and y2 to generate C. If maxY is large, then
this attack might be faster than computing all the FH values looking for
collisions. Since maxY is small (like 32), in a space of very many
possible FH values (2^50 of them), a scheme where we have to guess C first
wont work out.
I think this is secure for small values of maxY, and an HF that resists
algebraic analysis, such as ARX hashes.
On the down-side, even though I can now generate hash data at the full
bandwidth to my external DRAM, the hash-table lookups still heavily
dominate runtime, even when the tables fit into L2 cache. An FPGA with
similar memory bandwidth should be able to eliminate this extra latency,
and certainly we can do it with a custom ASIC. It seems that setting
random bits in L2 is a slow operation. I think the fastest algorithm I've
come up with so far is to do 2 rounds of radix-sort on the generated
hashes, which has very good cache performance. After that, the buckets of
hashes to be checked should fit into L1 cache. On my laptop, this agorithm
can generate 1 GiB of hash data in 0.2 seconds, but one radix-sort rouund
takes 0.6 seconds. The two radix-sorts and the final L1 hash table lookups
run around 1.5 seconds. However, the simple algorithm of just storing
hashes into a large external hash table takes only 2.7 seconds, so the
speed-up is less than 2X. This is not very impressive. A custom circuit
will beat this runtime by the difference in my bandwidth vs theirs. I
write 1 GiB twice, and read it twice, in 1.5 seconds. With a better
on-chip memory designed for this partial sorting, I would only have to read
and write memory once. An FPGA version with 25 GiB/s to memory should run
10X faster than me. I can improve with 2 processes in parallel, possibly
reducing the FPGA advantage to 5X-ish.
I don't know why radix sort is so slow. When using 1 bucket, I generate
1GiB of hashes in 0.21 seconds. With 2, it increases to 0.4 seconds, which
makes some sense. With 256 buckets, it takes 0.6 seconds, and that
required using Intel's non-temporal streaming intrinsics. I do not
understand why there is a decrease in speed when going from 2 buckets to
256, since I cache the data in each bucket in a 4 KiB cache, and only write
then to main memory when full. If I could get this to run at the full
memory bandwidth, I'd be in good shape.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the cryptography