<div dir="ltr"><div class="gmail_extra">A problem with Momentum as currently implemented is that it can be run with reduced memory, enabling <a href="http://da-data.blogspot.com/2014/01/gaining-momentum-duplicate-detection-in.html">efficient GPU attacks</a>.  I think we can get around this problem simply by changing the parameters it uses.  Here's what Momentum computes:</div><div class="gmail_extra"><br></div><div class="gmail_extra">for i, j in [0 .. 2^n-1]:</div><div class="gmail_extra">    if Hb(i) == Hb[j] and H(i || j) < threshold:</div><div class="gmail_extra">        report i, j</div><div class="gmail_extra"><br></div><div class="gmail_extra">In this case, Hb is specific to a block-chain digest and nonce, as is H.  In the current implementations, n is 2^26 and the size of Hb's output is 50 bits.  H's output is far larger, either 256 or 512 bits.  This leads to on average something like 3-ish collisions per nonce tried.</div><div class="gmail_extra"><br></div><div class="gmail_extra">If instead, we run with n == 26, and Hb's output also being 26 bits, then we get around 2^26 collisions.  Each needs to be tested to see if H(i || j) < threshold.  The obvious algorithm is to do this rapidly is:</div><div class="gmail_extra"><br></div><div class="gmail_extra">Initialize array M to 2^26 empty lists.</div><div class="gmail_extra">for i in [0 .. 2^n-1]:</div><div class="gmail_extra">    list = M[Hb(i)]</div><div class="gmail_extra">    for j in list:</div><div class="gmail_extra">        # i collides with j</div><div class="gmail_extra">        if H(i || j) < threshold:</div><div class="gmail_extra">            report i, j</div><div class="gmail_extra">        if H(j || i) < threshold:</div><div class="gmail_extra">            report j, i</div><div class="gmail_extra">    list.append(i)<br></div><div class="gmail_extra"><br></div><div class="gmail_extra">This seems to defeat Bloom filters and similar memory reduction techniques.  While a parallel Pollard Rho algorithm would work, it's slow and reduces memory no more than the simple TMTO attack where we only store some small range of Hb(i) in the hash table to reduce it's size.</div><div class="gmail_extra"><br></div><div class="gmail_extra">I am not sure why this approach is not used in current implementations.  Is there some attack I'm not seeing?</div><div class="gmail_extra"><br></div><div class="gmail_extra">While this slight modification might enable Momentum to resist GPU attacks, it's runtime is still dominated by Hb and H calculations, which an ASIC will speed up dramatically.  I get around that by generating 32 non-cryptographic hash values for each cryptographic hash generated, getting a speed-up of over 16X.  However, modern Intel CPUs seem to be very weak at running buffered radix-sort.  An ASIC will run at full memory bandwidth, which might be far higher than a typical CPU's bandwidth.  For example, I can fill 1 GiB with unsorted birthday hashes in 0.2 seconds on my machine.  When I create 256 buffers of 4KiB each, and do a byte-based radix-sort pass, this delay jumps up to about 0.65 seconds.  This is with temporal streaming writes, so I'm not polluting L2 or L3 cache with my buffer writes.  An ASIC would see no such additional delay, and would likely have significantly higher bandwidth to memory as well.</div><div class="gmail_extra"><br></div><div class="gmail_extra">I'm think that ASICs will still win significantly vs CPUs on every implementation I've seen of Momentum.  The enhancements above should, if I'm not mistaken, at least stop GPUs.</div><div class="gmail_extra"><br></div><div class="gmail_extra">Bill</div></div>