[Cryptography] Are Momentum and Cuckoo Cycle PoW algorithms broken?

Thu Jul 9 22:53:24 EDT 2015

I analyzed Momentum in more detail.  It is a very cool algorithm, and it is
not broken, but it is also not sequential-memory-hard.  However, it does
seem reasonably ASIC-resistant, more than I thought it would be.  Momentu's
ASIC defense seems to lie somewhere between BitCoin's SHA-256 based PoW,
and a large-memory Scrypt based PoW.  It's simplicity and rapid constant
time verification are a big plus.

I believe there will be an advantage for ASICs in that the external DRAM
can be eliminated because all the data can be split among many different
ASIC on-chip memories and that data can be shared while the ASICs do their
computations in parallel.  This is why Momentum is not
sequential-memory-hard.  The cost to do this is a special router between
the ASICs (one that operates as a barrel-shifter would work well).  One
that can only do barrel-shifter style routing would work well, and should
not significantly slow down computation.  However, this router could easily
cost more than the external DRAMs, so the savings may not work out in the
end.  However, it does prove that Momentum is not sequential-memory-hard,
which requires that at most no more than a constant speedup in runtime
occur when using no extra memory but more processing.

There are two other advantages that could be used in an ASIC
implementation, even without communication between the ASICs.  First,
Intel's SHA256 instructions are very slow compared to hardware.  An ASIC
will easily fill it's DRAM memory bandwidth, while even with 8 threads, we
can't even fill one bank worth of DRAM memory bandwidth using Intel/AMD
CPUs.  Second, some high-end ASICs have memory bandwidths exceeding 300
GiB/s, while my Intel CPU in my desktop is limited to about 25 GiB/s.
Since memory bandwidth is the liming factor in this case for the ASIC,
there seems to be a difference of up to 10X for improved hashing speed, and
another 10X for improved memory bandwidth, for a combined potential
advantage of 100X faster.  If a memory-hard algorithm like
Yscrypt/Lyra2/Argon2d/Scrypt were used, it would reduce this gain to maybe
10X, just the difference in memory bandwidth, and possibly less with
multiplication-chain compute-time hardening.

So, Momentum is better than I thought, but I think we should stop calling
it "memory-hard", and use an alternate term.  It is memory-intensive,
however.

As for Cuckoo Cycle, it inherits the memory-intensity of Momentum, adds
some TMTO resistance, and _might_ have some additional ASIC defense because
it finds cycles of length >2.  It is still important to keep the hash
function fast, and Cuckoo Cycle uses Siphash, so I'm please with that
choice.  The current dependence on DRAM cache-miss latency concerns me,
though.  I suspect I could design around it, as I do for Momentum below.
More analysis is needed, IMO.

Overall, the Momentum and Cuckoo Cycle authors seem to have done their
homework, and I found no glaring algorithm flaws.  The simplicity of
Momentum gives me some confidence in the algorithm.

----------------

Here I talk about the Momentum implementation in the BitShare miner code I
read.  It's flaws do not directly reflect on the Momentum algorithm, and if
this was meant purely as reference code, then it does it's job well.
However, it is slow, which seems odd for a miner.

If implementing Momentum, you must be very careful about relative sizes of
the input and output to the birthday hash function.  You may be vulnerable
to a parallel Pollard rho attack otherwise.  In particular, if the input
and output widths are equal (say 50 bits), then the memory requirement
almost completely goes away, and you can mount a massively parallel attack
with standard parallel Pollard Rho.  If the input bit width is about half
the output width, this seems to defeat Pollard Rho.  I am comfortable
saying that there seems to be no known way (at least that I can find after
some effort) to reduce the memory without increasing computations
proportionally in this case.  I am not ready to say it is impossible.
Clearly we cannot detect the collisions with reduced computations, but
there may still be decent memory reduction algorithms.

There are some issues with the implementation I read, though it was well
implemented, clear, and concise.  First, the SHA-512 hash fails to take
advantage of the new SHA-256 acceleration instructions.  It should be
replaced with SHA256, or possibly Siphash or maybe Blake2b.  As-is, an ASIC
attacker is getting a very significant bonus due to the slow hashing speed
of SHA-512.

Second, this implementation is needlessly slow, likely being cache-miss
penalty bound.  Perhaps it was only meant to be a reference minter.  A
faster implementation would have maybe 1024 buckets for storing hashes that
would be indexed by the high 10 bits of the birthday hash.  Worker threads
would generate hashes based on a fast hash rather than SHA512, and one
thread would gather the resulting hashes and append them to the buckets.
This would mostly eliminate L3 cache miss penalties during hash
generation.  A second pass would read the buckets one at a time (maybe with
parallel threads if the bandwidths allow), storing the hashes into a hash
table in L2 cache.  Again, this pass should incur very few L3 cache-miss
penalties, making me feel a lot better about ASIC defense, since we can at
least hope to limit the ASIC attack speed by I/O bandwidth.

I've read a few miner implementations now.  This one was nice.  They just
need a bit of a speed-monger to help out :-)

Bill
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.metzdowd.com/pipermail/cryptography/attachments/20150709/0367a53c/attachment.html>