[Cryptography] Is this parallel DPL solver for elliptic curves known?

Tue Jul 7 21:03:54 EDT 2015

Sorry if this is just a matter of my relative ignorance of crypto...

The algorithms I've found with my poor google-fu solve DLP over elliptic
curves in time much worse than sqrt(q)/p, for a group of order q, running
on p processes.  The obvious parallel algorithm is to cut the search space
into p linear sections, and have p processors run the algorithm on it's
sub-interval.  The computations and memory for each process is then
sqrt(q/p), for a total of sqrt(p)*sqrt(q), far worse than the serial case.

If instead, we could do this:

1) first, compute sqrt(q) baby step values in parallel, with total memory
sqrt(q) and time sqrt(q)/p.  Store these to cheap disk storage
2) next, assign each node a range of the result space, so the first node is
assigned hash values from 1 to q/p, etc
3) In a linear pass over the disk data, sort it into p lists according to
it's assigned node
4) In p steps, using a barrel-shifter shaped router, route hashes to their
destination nodes.  First, transmit the node0 data, then node1, etc.  This
takes time proportional to q/p, and a barrel-shifter shaped router of depth
log(p) and width p.  It is efficiently implemented using FPGAs, such as
those from Achronix.
5) Now repeat steps 1-4 using the giant steps, rather than baby steps.  At
the end, each node must search for identical hashes in the data it has
locally.
6) Sort the data using a fast radix sort to find the discrete log

Latency of the routing network is low enough to be ignored.  The bandwidth
of the hard disks involved is the limiting factor.  The algorithm can be
sped up so that instead of writing to local disk and partially sorting the
data, we simply write to local RAM and transmit buffers at nearly
disk-write speed to their destination, by having the barrel-shifter router
cycle rapidly.  So, the first 2 passes have runtime O(sqrt(q)/p) each.
Radix sort of the resulting data is fast, but there will still be several
passes over the disk.  This is the dominant step of the algorithm.  For
example, if sorting 16-byte values using radix 256, it will take 16 passes.

If I did the math correctly, the overall runtime is a small factor,
O((sqrt(b/r) + 2)*sqrt(q)/p), where b is the bits in the hash, and r is the
radix of the sorting step.  The speedup and total memory reduction is a
factor of O(p^(3/2)/(sqrt(b/r) + 2)).

This can be used for DLP, meet-in-the-middle parallel attacks, and other
applications.

Thanks,
Bill
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.metzdowd.com/pipermail/cryptography/attachments/20150707/c9048008/attachment.html>