Bernstein's NFS machine

Sat Mar 2 15:20:26 EST 2002

More analysis of Dan Bernstein's factoring machine from
http://cr.yp.to/papers.html#nfscircuit.

The NFS algorithm has two phases.  The first searches for coefficients
(a,b) from some interval which are relatively prime and which satisfy
two smoothness bounds.  The smoothness is with respect to a factor base
consisting of all primes less than some bound.  (Actually there may be
two or more bounds used for different tests, but that complication will
be ignored here.)  Each success produces one relation and will be one
row of a matrix whose columns correspond to the primes in the factor
base.

Once enough relations are collected (more rows than columns) the second
phase begins.  This involves looking for linear dependencies among the
rows of the matrix, where the math is in GF(2).  Once a dependency is
found, the corresponding rows can be combined to produce a value with
two different square roots, which with good probability will lead to
a factorization of n.

Dan Bernstein proposes methods to significantly reduce the cost (defined
as space times time) of both phases.

L is defined as exp(log(n)^(1/3) * log(log(n))^(2/3)), where n is the
number to be factored.  It is a key parameter for the work cost of the
NFS class of algorithms.

    n      L
-------------
 2^512   2^33
 2^1024  2^45
 2^1536  2^54
 2^2048  2^61

Let E represent the bound on the (a,b) coefficient values which are
searched.  Then E^2 values will be checked for relative primality and
smoothness.  Let B represent the bounds on the factor base.  A smooth
value is one for which all of its factors are less than B.

In previous work, sieving is used to test for smoothness, which is in
part where the Number Field Sieve algorithm gets its name.  Sieving
E^2 values over a factor base of size B takes roughly E^2 log log B
time and B space.  We will ignore the log log B factor and just say
E^2 and B space, for a total work factor of B*E^2.

With a factor base of size B, roughly B relations have to be found.
Then a sparse B by B matrix must be solved in the second phase.
Using conventional algorithms this solution takes roughly B^2 work on
a machine that has B space.  The total work factor is B^3.

For optimum balance, the work factor of these two phases is set equal,
giving B^3 = B*E^2, or B=E.  It is also necessary that E be large
enough to produce at least B relations.  This leads to the solution
E = B = L^0.961.  The corresponding work factor is this value cubed,
or L^2.883.  (Dan Bernstein gives a slightly lower figure due to an
improvement by Don Coppersmith.)

Bernstein improves on both of these phases.  He points out that for
sufficiently large n, it will be less work to determine smoothness by
simply trying to factor each value using the Elliptic Curve factoring
method (ECM).  This is the best factoring method known for finding
relatively small factors, as is required here.  Attempting to factor
E^2 values over a factor base of size B takes time proportional to
E^2 times a function of B which is larger than that in the sieve
case, but still asymptotically small compared to E^2.  So the time
is basically proportional to E^2 as with the sieve.

The savings comes in the space.  The sieve requires an array to hold
the primes in the factor base, and an array to represent the values
being sieved.  The most efficient approach is to set these two arrays
to roughly the same size, breaking the input into pieces of size B.
This does not slow the algorithm but allows it to run in space B as
assumed above.

In contrast, the ECM algorithm requires no large storage spaces.
It does not have to store the factor base or any representation of a
large block of numbers to be checked.  It simply runs the ECM factoring
algorithm on each value to be checked until it either finds a factor or
times out.  If it finds one, it divides it out and repeats the process
with the residue.  Eventually the residue is small enough that we know
the number is smooth, or the algorithm fails indicating that the value
is not smooth.  Storage requirements are minimal.  Based on this, the ECM
approach takes time proportional to E^2 and space essentially a constant,
for a total work of E^2.  This is compared with B*E^2 for the sieve.

Bernstein also improves on the second phase, finding a dependency among
the rows of the matrix.  It still requires space of size B to hold the
sparse matrix of size B by B.  But by making the matrix elements be active
instead of dumb memory cells, he can accelerate the matrix operations
that are necessary.  The result is that finding the dependencies is
reduced to taking time B^1.5 rather than B^2.  The total work is then
B^2.5.

For optimality these two phases are balanced with E^2 = B^2.5 or
E=B^1.25.  Incorporating the requirement that E be big enough to
generate the necessary B relations, the solution is E=L^0.988 and
B=L^0.790.  The total work is E^2 or L^1.976.  This is compared to the
L^2.883 value above, a significant savings.

It is interesting to compare the approaches used to accelerate the
two phases.  In each case we are essentially replacing memory with
computational elements, but the kinds of computation being done are very
different.

The matrix solution phase requires a high degree of parallelism.
The matrix-storage elements need to be enhanced with high speed local
interconnects and rudimentary processing capable of simple comparison
and exchange operations.  This would be an attractive candidate for a
custom IC where the basic processing cell would be small and laid out
in many replicating units on a chip.

In the smoothness testing phase, on the other hand, the requirements for
the computing elements are much greater.  Each one needs to be capable
of running the ECM factoring algorithm on large values of hundreds of
bits in size.  This is a relatively complex algorithm and would probably
require what is essentially a CPU with some kind of firmware to hold
the program.  It would probably not have to be as large and complex as
modern CPUs, but would be much more than the simple augmented memory
cell needed in the matrix solution phase.

The smoothness testing phase has the advantage that the individual
tests can be run independently, without any need to interconnect the
processing units (for other than I/O).  There is more flexibility in
terms of time/space tradeoffs in this phase.  If we need to use fewer
units and take more time, that will not affect the total work involved.

(Possibly a good architecture for the smoothness phase would be similar
to the 1980s-era Connection Machine from Thinking Machines.  The CM-2
had up to 64K processors with up to 32K of memory per processor.  It used
a SIMD (single instruction multiple data) architecture which meant that
each processing node ran the same instruction at the same time.  However
there were condition flags which could be set to selectively disable
instructions on certain processors based on the result of earlier tests.
This allows different nodes to behave differently depending on the data.)

The question remains at what point these improvements become effective.
In the first phase, the elementary step of the sieving machine is to add
two numbers together.  For the ECM machine it is to run the entire ECM
algorithm on a test value.  Both of these take asymptotically constant
time, but obviously the ECM operation will be far slower than the
sieving one.  Only by replacing passive sieve memory with active custom
ECM engines will the potential performance advantage of the ECM machine
be realized.

Note that the improvements to the two phases can be applied independently.
Either the speedup of the factoring or the speedup of the linear algebra
could be used with the other phase being done in the conventional way.
Because the nature of the machines for the two phases is very different,
it is likely that they have different crossover points at which the new
techniques become advantageous.  So it is very likely that there are
key sizes where only one of the new techniques should be used.

A final point: the factoring phase gets a greater improvement (exponent
reduced from 3 to 2) than the linear algebra phase (exponent reduced
from 3 to 2.5).  This leads to the use of a smaller factor base than
in the conventional method.  A smaller factor base is helpful for the
linear algebra since it gives a smaller matrix, but it is harmful to
the smoothness test since fewer numbers will be smooth over the smaller
factor base.  The greater speedup of the smoothness phase accounts for
this shift in emphasis.

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at wasabisystems.com