*OP1, MINT *OP2, MINT *RES)
     Set RES to the greatest common divisor of OP1 and OP2.

 -- Function: int mcmp (MINT *OP1, MINT *OP2)
     Compare OP1 and OP2.  Return a positive value if OP1 > OP2, zero
     if OP1 = OP2, and a negative value if OP1 < OP2.

 -- Function: void min (MINT *DEST)
     Input a decimal string from `stdin', and put the read integer in
     DEST.  SPC and TAB are allowed in the number string, and are
     ignored.

 -- Function: void mout (MINT *SRC)
     Output SRC to `stdout', as a decimal string.  Also output a
     newline.

 -- Function: char * mtox (MINT *OP)
     Convert OP to a hexadecimal string, and return a pointer to the
     string.  The returned string is allocated using the default memory
     allocation function, `malloc' by default.  It will be
     `strlen(str)+1' bytes, that being exactly enough for the string
     and null-terminator.

 -- Function: void mfree (MINT *OP)
     De-allocate, the space used by OP.  *This function should only be
     passed a value returned by `itom' or `xtom'.*


File: gmp.info,  Node: Custom Allocation,  Next: Language Bindings,  Prev: BSD Compatible Functions,  Up: Top

14 Custom Allocation
********************

By default GMP uses `malloc', `realloc' and `free' for memory
allocation, and if they fail GMP prints a message to the standard error
output and terminates the program.

   Alternate functions can be specified, to allocate memory in a
different way or to have a different error action on running out of
memory.

   This feature is available in the Berkeley compatibility library
(*note BSD Compatible Functions::) as well as the main GMP library.

 -- Function: void mp_set_memory_functions (
          void *(*ALLOC_FUNC_PTR) (size_t),
          void *(*REALLOC_FUNC_PTR) (void *, size_t, size_t),
          void (*FREE_FUNC_PTR) (void *, size_t))
     Replace the current allocation functions from the arguments.  If
     an argument is `NULL', the corresponding default function is used.

     These functions will be used for all memory allocation done by
     GMP, apart from temporary space from `alloca' if that function is
     available and GMP is configured to use it (*note Build Options::).

     *Be sure to call `mp_set_memory_functions' only when there are no
     active GMP objects allocated using the previous memory functions!
     Usually that means calling it before any other GMP function.*

   The functions supplied should fit the following declarations:

 -- Function: void * allocate_function (size_t ALLOC_SIZE)
     Return a pointer to newly allocated space with at least ALLOC_SIZE
     bytes.

 -- Function: void * reallocate_function (void *PTR, size_t OLD_SIZE,
          size_t NEW_SIZE)
     Resize a previously allocated block PTR of OLD_SIZE bytes to be
     NEW_SIZE bytes.

     The block may be moved if necessary or if desired, and in that
     case the smaller of OLD_SIZE and NEW_SIZE bytes must be copied to
     the new location.  The return value is a pointer to the resized
     block, that being the new location if moved or just PTR if not.

     PTR is never `NULL', it's always a previously allocated block.
     NEW_SIZE may be bigger or smaller than OLD_SIZE.

 -- Function: void free_function (void *PTR, size_t SIZE)
     De-allocate the space pointed to by PTR.

     PTR is never `NULL', it's always a previously allocated block of
     SIZE bytes.

   A "byte" here means the unit used by the `sizeof' operator.

   The OLD_SIZE parameters to REALLOCATE_FUNCTION and FREE_FUNCTION are
passed for convenience, but of course can be ignored if not needed.
The default functions using `malloc' and friends for instance don't use
them.

   No error return is allowed from any of these functions, if they
return then they must have performed the specified operation.  In
particular note that ALLOCATE_FUNCTION or REALLOCATE_FUNCTION mustn't
return `NULL'.

   Getting a different fatal error action is a good use for custom
allocation functions, for example giving a graphical dialog rather than
the default print to `stderr'.  How much is possible when genuinely out
of memory is another question though.

   There's currently no defined way for the allocation functions to
recover from an error such as out of memory, they must terminate
program execution.  A `longjmp' or throwing a C++ exception will have
undefined results.  This may change in the future.

   GMP may use allocated blocks to hold pointers to other allocated
blocks.  This will limit the assumptions a conservative garbage
collection scheme can make.

   Since the default GMP allocation uses `malloc' and friends, those
functions will be linked in even if the first thing a program does is an
`mp_set_memory_functions'.  It's necessary to change the GMP sources if
this is a problem.


 -- Function: void mp_get_memory_functions (
          void *(**ALLOC_FUNC_PTR) (size_t),
          void *(**REALLOC_FUNC_PTR) (void *, size_t, size_t),
          void (**FREE_FUNC_PTR) (void *, size_t))
     Get the current allocation functions, storing function pointers to
     the locations given by the arguments.  If an argument is `NULL',
     that function pointer is not stored.

     For example, to get just the current free function,

          void (*freefunc) (void *, size_t);

          mp_get_memory_functions (NULL, NULL, &freefunc);


File: gmp.info,  Node: Language Bindings,  Next: Algorithms,  Prev: Custom Allocation,  Up: Top

15 Language Bindings
********************

The following packages and projects offer access to GMP from languages
other than C, though perhaps with varying levels of functionality and
efficiency.


C++
        * GMP C++ class interface, *note C++ Class Interface::
          Straightforward interface, expression templates to eliminate
          temporaries.

        * ALP `http://www-sop.inria.fr/saga/logiciels/ALP/'
          Linear algebra and polynomials using templates.

        * Arithmos `http://www.win.ua.ac.be/~cant/arithmos/'
          Rationals with infinities and square roots.

        * CLN `http://www.ginac.de/CLN/'
          High level classes for arithmetic.

        * LiDIA `http://www.cdc.informatik.tu-darmstadt.de/TI/LiDIA/'
          A C++ library for computational number theory.

        * Linbox `http://www.linalg.org/'
          Sparse vectors and matrices.

        * NTL `http://www.shoup.net/ntl/'
          A C++ number theory library.

Fortran
        * Omni F77 `http://phase.hpcc.jp/Omni/home.html'
          Arbitrary precision floats.

Haskell
        * Glasgow Haskell Compiler `http://www.haskell.org/ghc/'

Java
        * Kaffe `http://www.kaffe.org/'

        * Kissme `http://kissme.sourceforge.net/'

Lisp
        * GNU Common Lisp `http://www.gnu.org/software/gcl/gcl.html'

        * Librep `http://librep.sourceforge.net/'

        * XEmacs (21.5.18 beta and up) `http://www.xemacs.org'
          Optional big integers, rationals and floats using GMP.

M4
        * GNU m4 betas `http://www.seindal.dk/rene/gnu/'
          Optionally provides an arbitrary precision `mpeval'.

ML
        * MLton compiler `http://mlton.org/'

Objective Caml
        * MLGMP `http://www.di.ens.fr/~monniaux/programmes.html.en'

        * Numerix `http://pauillac.inria.fr/~quercia/'
          Optionally using GMP.

Oz
        * Mozart `http://www.mozart-oz.org/'

Pascal
        * GNU Pascal Compiler `http://www.gnu-pascal.de/'
          GMP unit.

        * Numerix `http://pauillac.inria.fr/~quercia/'
          For Free Pascal, optionally using GMP.

Perl
        * GMP module, see `demos/perl' in the GMP sources (*note
          Demonstration Programs::).

        * Math::GMP `http://www.cpan.org/'
          Compatible with Math::BigInt, but not as many functions as
          the GMP module above.

        * Math::BigInt::GMP `http://www.cpan.org/'
          Plug Math::GMP into normal Math::BigInt operations.

Pike
        * mpz module in the standard distribution,
          `http://pike.ida.liu.se/'

Prolog
        * SWI Prolog `http://www.swi-prolog.org/'
          Arbitrary precision floats.

Python
        * mpz module in the standard distribution,
          `http://www.python.org/'

        * GMPY `http://gmpy.sourceforge.net/'

Scheme
        * GNU Guile (upcoming 1.8)
          `http://www.gnu.org/software/guile/guile.html'

        * RScheme `http://www.rscheme.org/'

        * STklos `http://www.stklos.org/'

Smalltalk
        * GNU Smalltalk
          `http://www.smalltalk.org/versions/GNUSmalltalk.html'

Other
        * Axiom `http://savannah.nongnu.org/projects/axiom'
          Computer algebra using GCL.

        * DrGenius `http://drgenius.seul.org/'
          Geometry system and mathematical programming language.

        * GiNaC `http://www.ginac.de/'
          C++ computer algebra using CLN.

        * GOO `http://www.googoogaga.org/'
          Dynamic object oriented language.

        * Maxima `http://www.ma.utexas.edu/users/wfs/maxima.html'
          Macsyma computer algebra using GCL.

        * Q `http://q-lang.sourceforge.net/'
          Equational programming system.

        * Regina `http://regina.sourceforge.net/'
          Topological calculator.

        * Yacas `http://www.xs4all.nl/~apinkus/yacas.html'
          Yet another computer algebra system.



File: gmp.info,  Node: Algorithms,  Next: Internals,  Prev: Language Bindings,  Up: Top

16 Algorithms
*************

This chapter is an introduction to some of the algorithms used for
various GMP operations.  The code is likely to be hard to understand
without knowing something about the algorithms.

   Some GMP internals are mentioned, but applications that expect to be
compatible with future GMP releases should take care to use only the
documented functions.

* Menu:

* Multiplication Algorithms::
* Division Algorithms::
* Greatest Common Divisor Algorithms::
* Powering Algorithms::
* Root Extraction Algorithms::
* Radix Conversion Algorithms::
* Other Algorithms::
* Assembly Coding::


File: gmp.info,  Node: Multiplication Algorithms,  Next: Division Algorithms,  Prev: Algorithms,  Up: Algorithms

16.1 Multiplication
===================

NxN limb multiplications and squares are done using one of five
algorithms, as the size N increases.

     Algorithm      Threshold
     Basecase       (none)
     Karatsuba      `MUL_TOOM22_THRESHOLD'
     Toom-3         `MUL_TOOM33_THRESHOLD'
     Toom-4         `MUL_TOOM44_THRESHOLD'
     FFT            `MUL_FFT_THRESHOLD'

   Similarly for squaring, with the `SQR' thresholds.

   NxM multiplications of operands with different sizes above
`MUL_TOOM22_THRESHOLD' are currently done by special Toom-inspired
algorithms or directly with FFT, depending on operand size (*note
Unbalanced Multiplication::).

* Menu:

* Basecase Multiplication::
* Karatsuba Multiplication::
* Toom 3-Way Multiplication::
* Toom 4-Way Multiplication::
* FFT Multiplication::
* Other Multiplication::
* Unbalanced Multiplication::


File: gmp.info,  Node: Basecase Multiplication,  Next: Karatsuba Multiplication,  Prev: Multiplication Algorithms,  Up: Multiplication Algorithms

16.1.1 Basecase Multiplication
------------------------------

Basecase NxM multiplication is a straightforward rectangular set of
cross-products, the same as long multiplication done by hand and for
that reason sometimes known as the schoolbook or grammar school method.
This is an O(N*M) algorithm.  See Knuth section 4.3.1 algorithm M
(*note References::), and the `mpn/generic/mul_basecase.c' code.

   Assembly implementations of `mpn_mul_basecase' are essentially the
same as the generic C code, but have all the usual assembly tricks and
obscurities introduced for speed.

   A square can be done in roughly half the time of a multiply, by
using the fact that the cross products above and below the diagonal are
the same.  A triangle of products below the diagonal is formed, doubled
(left shift by one bit), and then the products on the diagonal added.
This can be seen in `mpn/generic/sqr_basecase.c'.  Again the assembly
implementations take essentially the same approach.

          u0  u1  u2  u3  u4
        +---+---+---+---+---+
     u0 | d |   |   |   |   |
        +---+---+---+---+---+
     u1 |   | d |   |   |   |
        +---+---+---+---+---+
     u2 |   |   | d |   |   |
        +---+---+---+---+---+
     u3 |   |   |   | d |   |
        +---+---+---+---+---+
     u4 |   |   |   |   | d |
        +---+---+---+---+---+

   In practice squaring isn't a full 2x faster than multiplying, it's
usually around 1.5x.  Less than 1.5x probably indicates
`mpn_sqr_basecase' wants improving on that CPU.

   On some CPUs `mpn_mul_basecase' can be faster than the generic C
`mpn_sqr_basecase' on some small sizes.  `SQR_BASECASE_THRESHOLD' is
the size at which to use `mpn_sqr_basecase', this will be zero if that
routine should be used always.


File: gmp.info,  Node: Karatsuba Multiplication,  Next: Toom 3-Way Multiplication,  Prev: Basecase Multiplication,  Up: Multiplication Algorithms

16.1.2 Karatsuba Multiplication
-------------------------------

The Karatsuba multiplication algorithm is described in Knuth section
4.3.3 part A, and various other textbooks.  A brief description is
given here.

   The inputs x and y are treated as each split into two parts of equal
length (or the most significant part one limb shorter if N is odd).

      high              low
     +----------+----------+
     |    x1    |    x0    |
     +----------+----------+

     +----------+----------+
     |    y1    |    y0    |
     +----------+----------+

   Let b be the power of 2 where the split occurs, ie. if x0 is k limbs
(y0 the same) then b=2^(k*mp_bits_per_limb).  With that x=x1*b+x0 and
y=y1*b+y0, and the following holds,

     x*y = (b^2+b)*x1*y1 - b*(x1-x0)*(y1-y0) + (b+1)*x0*y0

   This formula means doing only three multiplies of (N/2)x(N/2) limbs,
whereas a basecase multiply of NxN limbs is equivalent to four
multiplies of (N/2)x(N/2).  The factors (b^2+b) etc represent the
positions where the three products must be added.

      high                              low
     +--------+--------+ +--------+--------+
     |      x1*y1      | |      x0*y0      |
     +--------+--------+ +--------+--------+
               +--------+--------+
           add |      x1*y1      |
               +--------+--------+
               +--------+--------+
           add |      x0*y0      |
               +--------+--------+
               +--------+--------+
           sub | (x1-x0)*(y1-y0) |
               +--------+--------+

   The term (x1-x0)*(y1-y0) is best calculated as an absolute value,
and the sign used to choose to add or subtract.  Notice the sum
high(x0*y0)+low(x1*y1) occurs twice, so it's possible to do 5*k limb
additions, rather than 6*k, but in GMP extra function call overheads
outweigh the saving.

   Squaring is similar to multiplying, but with x=y the formula reduces
to an equivalent with three squares,

     x^2 = (b^2+b)*x1^2 - b*(x1-x0)^2 + (b+1)*x0^2

   The final result is accumulated from those three squares the same
way as for the three multiplies above.  The middle term (x1-x0)^2 is now
always positive.

   A similar formula for both multiplying and squaring can be
constructed with a middle term (x1+x0)*(y1+y0).  But those sums can
exceed k limbs, leading to more carry handling and additions than the
form above.

   Karatsuba multiplication is asymptotically an O(N^1.585) algorithm,
the exponent being log(3)/log(2), representing 3 multiplies each 1/2
the size of the inputs.  This is a big improvement over the basecase
multiply at O(N^2) and the advantage soon overcomes the extra additions
Karatsuba performs.  `MUL_TOOM22_THRESHOLD' can be as little as 10
limbs.  The `SQR' threshold is usually about twice the `MUL'.

   The basecase algorithm will take a time of the form M(N) = a*N^2 +
b*N + c and the Karatsuba algorithm K(N) = 3*M(N/2) + d*N + e, which
expands to K(N) = 3/4*a*N^2 + 3/2*b*N + 3*c + d*N + e.  The factor 3/4
for a means per-crossproduct speedups in the basecase code will
increase the threshold since they benefit M(N) more than K(N).  And
conversely the 3/2 for b means linear style speedups of b will increase
the threshold since they benefit K(N) more than M(N).  The latter can
be seen for instance when adding an optimized `mpn_sqr_diagonal' to
`mpn_sqr_basecase'.  Of course all speedups reduce total time, and in
that sense the algorithm thresholds are merely of academic interest.


File: gmp.info,  Node: Toom 3-Way Multiplication,  Next: Toom 4-Way Multiplication,  Prev: Karatsuba Multiplication,  Up: Multiplication Algorithms

16.1.3 Toom 3-Way Multiplication
--------------------------------

The Karatsuba formula is the simplest case of a general approach to
splitting inputs that leads to both Toom and FFT algorithms.  A
description of Toom can be found in Knuth section 4.3.3, with an
example 3-way calculation after Theorem A.  The 3-way form used in GMP
is described here.

   The operands are each considered split into 3 pieces of equal length
(or the most significant part 1 or 2 limbs shorter than the other two).

      high                         low
     +----------+----------+----------+
     |    x2    |    x1    |    x0    |
     +----------+----------+----------+

     +----------+----------+----------+
     |    y2    |    y1    |    y0    |
     +----------+----------+----------+

These parts are treated as the coefficients of two polynomials

     X(t) = x2*t^2 + x1*t + x0
     Y(t) = y2*t^2 + y1*t + y0

   Let b equal the power of 2 which is the size of the x0, x1, y0 and
y1 pieces, ie. if they're k limbs each then b=2^(k*mp_bits_per_limb).
With this x=X(b) and y=Y(b).

   Let a polynomial W(t)=X(t)*Y(t) and suppose its coefficients are

     W(t) = w4*t^4 + w3*t^3 + w2*t^2 + w1*t + w0

   The w[i] are going to be determined, and when they are they'll give
the final result using w=W(b), since x*y=X(b)*Y(b)=W(b).  The
coefficients will be roughly b^2 each, and the final W(b) will be an
addition like,

      high                                        low
     +-------+-------+
     |       w4      |
     +-------+-------+
            +--------+-------+
            |        w3      |
            +--------+-------+
                    +--------+-------+
                    |        w2      |
                    +--------+-------+
                            +--------+-------+
                            |        w1      |
                            +--------+-------+
                                     +-------+-------+
                                     |       w0      |
                                     +-------+-------+

   The w[i] coefficients could be formed by a simple set of cross
products, like w4=x2*y2, w3=x2*y1+x1*y2, w2=x2*y0+x1*y1+x0*y2 etc, but
this would need all nine x[i]*y[j] for i,j=0,1,2, and would be
equivalent merely to a basecase multiply.  Instead the following
approach is used.

   X(t) and Y(t) are evaluated and multiplied at 5 points, giving
values of W(t) at those points.  In GMP the following points are used,

     Point    Value
     t=0      x0 * y0, which gives w0 immediately
     t=1      (x2+x1+x0) * (y2+y1+y0)
     t=-1     (x2-x1+x0) * (y2-y1+y0)
     t=2      (4*x2+2*x1+x0) * (4*y2+2*y1+y0)
     t=inf    x2 * y2, which gives w4 immediately

   At t=-1 the values can be negative and that's handled using the
absolute values and tracking the sign separately.  At t=inf the value
is actually X(t)*Y(t)/t^4 in the limit as t approaches infinity, but
it's much easier to think of as simply x2*y2 giving w4 immediately
(much like x0*y0 at t=0 gives w0 immediately).

   Each of the points substituted into W(t)=w4*t^4+...+w0 gives a
linear combination of the w[i] coefficients, and the value of those
combinations has just been calculated.

     W(0)   =                              w0
     W(1)   =    w4 +   w3 +   w2 +   w1 + w0
     W(-1)  =    w4 -   w3 +   w2 -   w1 + w0
     W(2)   = 16*w4 + 8*w3 + 4*w2 + 2*w1 + w0
     W(inf) =    w4

   This is a set of five equations in five unknowns, and some
elementary linear algebra quickly isolates each w[i].  This involves
adding or subtracting one W(t) value from another, and a couple of
divisions by powers of 2 and one division by 3, the latter using the
special `mpn_divexact_by3' (*note Exact Division::).

   The conversion of W(t) values to the coefficients is interpolation.
A polynomial of degree 4 like W(t) is uniquely determined by values
known at 5 different points.  The points are arbitrary and can be
chosen to make the linear equations come out with a convenient set of
steps for quickly isolating the w[i].

   Squaring follows the same procedure as multiplication, but there's
only one X(t) and it's evaluated at the 5 points, and those values
squared to give values of W(t).  The interpolation is then identical,
and in fact the same `toom3_interpolate' subroutine is used for both
squaring and multiplying.

   Toom-3 is asymptotically O(N^1.465), the exponent being
log(5)/log(3), representing 5 recursive multiplies of 1/3 the original
size each.  This is an improvement over Karatsuba at O(N^1.585), though
Toom does more work in the evaluation and interpolation and so it only
realizes its advantage above a certain size.

   Near the crossover between Toom-3 and Karatsuba there's generally a
range of sizes where the difference between the two is small.
`MUL_TOOM33_THRESHOLD' is a somewhat arbitrary point in that range and
successive runs of the tune program can give different values due to
small variations in measuring.  A graph of time versus size for the two
shows the effect, see `tune/README'.

   At the fairly small sizes where the Toom-3 thresholds occur it's
worth remembering that the asymptotic behaviour for Karatsuba and
Toom-3 can't be expected to make accurate predictions, due of course to
the big influence of all sorts of overheads, and the fact that only a
few recursions of each are being performed.  Even at large sizes
there's a good chance machine dependent effects like cache architecture
will mean actual performance deviates from what might be predicted.

   The formula given for the Karatsuba algorithm (*note Karatsuba
Multiplication::) has an equivalent for Toom-3 involving only five
multiplies, but this would be complicated and unenlightening.

   An alternate view of Toom-3 can be found in Zuras (*note
References::), using a vector to represent the x and y splits and a
matrix multiplication for the evaluation and interpolation stages.  The
matrix inverses are not meant to be actually used, and they have
elements with values much greater than in fact arise in the
interpolation steps.  The diagram shown for the 3-way is attractive,
but again doesn't have to be implemented that way and for example with
a bit of rearrangement just one division by 6 can be done.


File: gmp.info,  Node: Toom 4-Way Multiplication,  Next: FFT Multiplication,  Prev: Toom 3-Way Multiplication,  Up: Multiplication Algorithms

16.1.4 Toom 4-Way Multiplication
--------------------------------

Karatsuba and Toom-3 split the operands into 2 and 3 coefficients,
respectively.  Toom-4 analogously splits the operands into 4
coefficients.  Using the notation from the section on Toom-3
multiplication, we form two polynomials:

     X(t) = x3*t^3 + x2*t^2 + x1*t + x0
     Y(t) = y3*t^3 + y2*t^2 + y1*t + y0

   X(t) and Y(t) are evaluated and multiplied at 7 points, giving
values of W(t) at those points.  In GMP the following points are used,

     Point    Value
     t=0      x0 * y0, which gives w0 immediately
     t=1/2    (x3+2*x2+4*x1+8*x0) * (y3+2*y2+4*y1+8*y0)
     t=-1/2   (-x3+2*x2-4*x1+8*x0) * (-y3+2*y2-4*y1+8*y0)
     t=1      (x3+x2+x1+x0) * (y3+y2+y1+y0)
     t=-1     (-x3+x2-x1+x0) * (-y3+y2-y1+y0)
     t=2      (8*x3+4*x2+2*x1+x0) * (8*y3+4*y2+2*y1+y0)
     t=inf    x3 * y3, which gives w6 immediately

   The number of additions and subtractions for Toom-4 is much larger
than for Toom-3.  But several subexpressions occur multiple times, for
example x2+x0, occurs for both t=1 and t=-1.

   Toom-4 is asymptotically O(N^1.404), the exponent being
log(7)/log(4), representing 7 recursive multiplies of 1/4 the original
size each.


File: gmp.info,  Node: FFT Multiplication,  Next: Other Multiplication,  Prev: Toom 4-Way Multiplication,  Up: Multiplication Algorithms

16.1.5 FFT Multiplication
-------------------------

At large to very large sizes a Fermat style FFT multiplication is used,
following Scho"nhage and Strassen (*note References::).  Descriptions
of FFTs in various forms can be found in many textbooks, for instance
Knuth section 4.3.3 part C or Lipson chapter IX.  A brief description
of the form used in GMP is given here.

   The multiplication done is x*y mod 2^N+1, for a given N.  A full
product x*y is obtained by choosing N>=bits(x)+bits(y) and padding x
and y with high zero limbs.  The modular product is the native form for
the algorithm, so padding to get a full product is unavoidable.

   The algorithm follows a split, evaluate, pointwise multiply,
interpolate and combine similar to that described above for Karatsuba
and Toom-3.  A k parameter controls the split, with an FFT-k splitting
into 2^k pieces of M=N/2^k bits each.  N must be a multiple of
(2^k)*mp_bits_per_limb so the split falls on limb boundaries, avoiding
bit shifts in the split and combine stages.

   The evaluations, pointwise multiplications, and interpolation, are
all done modulo 2^N'+1 where N' is 2M+k+3 rounded up to a multiple of
2^k and of `mp_bits_per_limb'.  The results of interpolation will be
the following negacyclic convolution of the input pieces, and the
choice of N' ensures these sums aren't truncated.

                ---
                \         b
     w[n] =     /     (-1) * x[i] * y[j]
                ---
            i+j==b*2^k+n
               b=0,1

   The points used for the evaluation are g^i for i=0 to 2^k-1 where
g=2^(2N'/2^k).  g is a 2^k'th root of unity mod 2^N'+1, which produces
necessary cancellations at the interpolation stage, and it's also a
power of 2 so the fast Fourier transforms used for the evaluation and
interpolation do only shifts, adds and negations.

   The pointwise multiplications are done modulo 2^N'+1 and either
recurse into a further FFT or use a plain multiplication (Toom-3,
Karatsuba or basecase), whichever is optimal at the size N'.  The
interpolation is an inverse fast Fourier transform.  The resulting set
of sums of x[i]*y[j] are added at appropriate offsets to give the final
result.

   Squaring is the same, but x is the only input so it's one transform
at the evaluate stage and the pointwise multiplies are squares.  The
interpolation is the same.

   For a mod 2^N+1 product, an FFT-k is an O(N^(k/(k-1))) algorithm,
the exponent representing 2^k recursed modular multiplies each
1/2^(k-1) the size of the original.  Each successive k is an asymptotic
improvement, but overheads mean each is only faster at bigger and
bigger sizes.  In the code, `MUL_FFT_TABLE' and `SQR_FFT_TABLE' are the
thresholds where each k is used.  Each new k effectively swaps some
multiplying for some shifts, adds and overheads.

   A mod 2^N+1 product can be formed with a normal NxN->2N bit multiply
plus a subtraction, so an FFT and Toom-3 etc can be compared directly.
A k=4 FFT at O(N^1.333) can be expected to be the first faster than
Toom-3 at O(N^1.465).  In practice this is what's found, with
`MUL_FFT_MODF_THRESHOLD' and `SQR_FFT_MODF_THRESHOLD' being between 300
and 1000 limbs, depending on the CPU.  So far it's been found that only
very large FFTs recurse into pointwise multiplies above these sizes.

   When an FFT is to give a full product, the change of N to 2N doesn't
alter the theoretical complexity for a given k, but for the purposes of
considering where an FFT might be first used it can be assumed that the
FFT is recursing into a normal multiply and that on that basis it's
doing 2^k recursed multiplies each 1/2^(k-2) the size of the inputs,
making it O(N^(k/(k-2))).  This would mean k=7 at O(N^1.4) would be the
first FFT faster than Toom-3.  In practice `MUL_FFT_THRESHOLD' and
`SQR_FFT_THRESHOLD' have been found to be in the k=8 range, somewhere
between 3000 and 10000 limbs.

   The way N is split into 2^k pieces and then 2M+k+3 is rounded up to
a multiple of 2^k and `mp_bits_per_limb' means that when
2^k>=mp_bits_per_limb the effective N is a multiple of 2^(2k-1) bits.
The +k+3 means some values of N just under such a multiple will be
rounded to the next.  The complexity calculations above assume that a
favourable size is used, meaning one which isn't padded through
rounding, and it's also assumed that the extra +k+3 bits are negligible
at typical FFT sizes.

   The practical effect of the 2^(2k-1) constraint is to introduce a
step-effect into measured speeds.  For example k=8 will round N up to a
multiple of 32768 bits, so for a 32-bit limb there'll be 512 limb
groups of sizes for which `mpn_mul_n' runs at the same speed.  Or for
k=9 groups of 2048 limbs, k=10 groups of 8192 limbs, etc.  In practice
it's been found each k is used at quite small multiples of its size
constraint and so the step effect is quite noticeable in a time versus
size graph.

   The threshold determinations currently measure at the mid-points of
size steps, but this is sub-optimal since at the start of a new step it
can happen that it's better to go back to the previous k for a while.
Something more sophisticated for `MUL_FFT_TABLE' and `SQR_FFT_TABLE'
will be needed.


File: gmp.info,  Node: Other Multiplication,  Next: Unbalanced Multiplication,  Prev: FFT Multiplication,  Up: Multiplication Algorithms

16.1.6 Other Multiplication
---------------------------

The Toom algorithms described above (*note Toom 3-Way Multiplication::,
*note Toom 4-Way Multiplication::) generalizes to split into an
arbitrary number of pieces, as per Knuth section 4.3.3 algorithm C.
This is not currently used.  The notes here are merely for interest.

   In general a split into r+1 pieces is made, and evaluations and
pointwise multiplications done at 2*r+1 points.  A 4-way split does 7
pointwise multiplies, 5-way does 9, etc.  Asymptotically an (r+1)-way
algorithm is O(N^(log(2*r+1)/log(r+1))).  Only the pointwise
multiplications count towards big-O complexity, but the time spent in
the evaluate and interpolate stages grows with r and has a significant
practical impact, with the asymptotic advantage of each r realized only
at bigger and bigger sizes.  The overheads grow as O(N*r), whereas in
an r=2^k FFT they grow only as O(N*log(r)).

�                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Knuth algorithm C evaluates at points 0,1,2,...,2*r, but exercise 4
uses -r,...,0,...,r and the latter saves some small multiplies in the
evaluate stage (or rather trades them for additions), and has a further
saving of nearly half the interpolate steps.  The idea is to separate
odd and even final coefficients and then perform algorithm C steps C7
and C8 on them separately.  The divisors at step C7 become j^2 and the
multipliers at C8 become 2*t*j-j^2.

   Splitting odd and even parts through positive and negative points
can be thought of as using -1 as a square root of unity.  If a 4th root
of unity was available then a further split and speedup would be
possible, but no such root exists for plain integers.  Going to complex
integers with i=sqrt(-1) doesn't help, essentially because in Cartesian
form it takes three real multiplies to do a complex multiply.  The
existence of 2^k'th roots of unity in a suitable ring or field lets the
fast Fourier transform keep splitting and get to O(N*log(r)).

   Floating point FFTs use complex numbers approximating Nth roots of
unity.  Some processors have special support for such FFTs.  But these
are not used in GMP since it's very difficult to guarantee an exact
result (to some number of bits).  An occasional difference of 1 in the
last bit might not matter to a typical signal processing algorithm, but
is of course of vital importance to GMP.


File: gmp.info,  Node: Unbalanced Multiplication,  Prev: Other Multiplication,  Up: Multiplication Algorithms

16.1.7 Unbalanced Multiplication
--------------------------------

Multiplication of operands with different sizes, both below
`MUL_TOOM22_THRESHOLD' are done with plain schoolbook multiplication
(*note Basecase Multiplication::).

   For really large operands, we invoke FFT directly.

   For operands between these sizes, we use Toom inspired algorithms
suggested by Alberto Zanoni and Marco Bodrato.  The idea is to split
the operands into polynomials of different degree.  GMP currently
splits the smaller operand onto 2 coefficients, i.e., a polynomial of
degree 1, but the larger operand can be split into 2, 3, or 4
coefficients, i.e., a polynomial of degree 1 to 3.


File: gmp.info,  Node: Division Algorithms,  Next: Greatest Common Divisor Algorithms,  Prev: Multiplication Algorithms,  Up: Algorithms

16.2 Division Algorithms
========================

* Menu:

* Single Limb Division::
* Basecase Division::
* Divide and Conquer Division::
* Block-Wise Barrett Division::
* Exact Division::
* Exact Remainder::
* Small Quotient Division::


File: gmp.info,  Node: Single Limb Division,  Next: Basecase Division,  Prev: Division Algorithms,  Up: Division Algorithms

16.2.1 Single Limb Division
---------------------------

Nx1 division is implemented using repeated 2x1 divisions from high to
low, either with a hardware divide instruction or a multiplication by
inverse, whichever is best on a given CPU.

   The multiply by inverse follows "Improved division by invariant
integers" by Mo"ller and Granlund (*note References::) and is
implemented as `udiv_qrnnd_preinv' in `gmp-impl.h'.  The idea is to
have a fixed-point approximation to 1/d (see `invert_limb') and then
multiply by the high limb (plus one bit) of the dividend to get a
quotient q.  With d normalized (high bit set), q is no more than 1 too
small.  Subtracting q*d from the dividend gives a remainder, and
reveals whether q or q-1 is correct.

   The result is a division done with two multiplications and four or
five arithmetic operations.  On CPUs with low latency multipliers this
can be much faster than a hardware divide, though the cost of
calculating the inverse at the start may mean it's only better on
inputs bigger than say 4 or 5 limbs.

   When a divisor must be normalized, either for the generic C
`__udiv_qrnnd_c' or the multiply by inverse, the division performed is
actually a*2^k by d*2^k where a is the dividend and k is the power
necessary to have the high bit of d*2^k set.  The bit shifts for the
dividend are usually accomplished "on the fly" meaning by extracting
the appropriate bits at each step.  Done this way the quotient limbs
come out aligned ready to store.  When only the remainder is wanted, an
alternative is to take the dividend limbs unshifted and calculate r = a
mod d*2^k followed by an extra final step r*2^k mod d*2^k.  This can
help on CPUs with poor bit shifts or few registers.

   The multiply by inverse can be done two limbs at a time.  The
calculation is basically the same, but the inverse is two limbs and the
divisor treated as if padded with a low zero limb.  This means more
work, since the inverse will need a 2x2 multiply, but the four 1x1s to
do that are independent and can therefore be done partly or wholly in
parallel.  Likewise for a 2x1 calculating q*d.  The net effect is to
process two limbs with roughly the same two multiplies worth of latency
that one limb at a time gives.  This extends to 3 or 4 limbs at a time,
though the extra work to apply the inverse will almost certainly soon
reach the limits of multiplier throughput.

   A similar approach in reverse can be taken to process just half a
limb at a time if the divisor is only a half limb.  In this case the
1x1 multiply for the inverse effectively becomes two (1/2)x1 for each
limb, which can be a saving on CPUs with a fast half limb multiply, or
in fact if the only multiply is a half limb, and especially if it's not
pipelined.


File: gmp.info,  Node: Basecase Division,  Next: Divide and Conquer Division,  Prev: Single Limb Division,  Up: Division Algorithms

16.2.2 Basecase Division
------------------------

Basecase NxM division is like long division done by hand, but in base
2^mp_bits_per_limb.  See Knuth section 4.3.1 algorithm D, and
`mpn/generic/sb_divrem_mn.c'.

   Briefly stated, while the dividend remains larger than the divisor,
a high quotient limb is formed and the Nx1 product q*d subtracted at
the top end of the dividend.  With a normalized divisor (most
significant bit set), each quotient limb can be formed with a 2x1
division and a 1x1 multiplication plus some subtractions.  The 2x1
division is by the high limb of the divisor and is done either with a
hardware divide or a multiply by inverse (the same as in *Note Single
Limb Division::) whichever is faster.  Such a quotient is sometimes one
too big, requiring an addback of the divisor, but that happens rarely.

   With Q=N-M being the number of quotient limbs, this is an O(Q*M)
algorithm and will run at a speed similar to a basecase QxM
multiplication, differing in fact only in the extra multiply and divide
for each of the Q quotient limbs.


File: gmp.info,  Node: Divide and Conquer Division,  Next: Block-Wise Barrett Division,  Prev: Basecase Division,  Up: Division Algorithms

16.2.3 Divide and Conquer Division
----------------------------------

For divisors larger than `DC_DIV_QR_THRESHOLD', division is done by
dividing.  Or to be precise by a recursive divide and conquer algorithm
based on work by Moenck and Borodin, Jebelean, and Burnikel and Ziegler
(*note References::).

   The algorithm consists essentially of recognising that a 2NxN
division can be done with the basecase division algorithm (*note
Basecase Division::), but using N/2 limbs as a base, not just a single
limb.  This way the multiplications that arise are (N/2)x(N/2) and can
take advantage of Karatsuba and higher multiplication algorithms (*note
Multiplication Algorithms::).  The two "digits" of the quotient are
formed by recursive Nx(N/2) divisions.

   If the (N/2)x(N/2) multiplies are done with a basecase multiplication
then the work is about the same as a basecase division, but with more
function call overheads and with some subtractions separated from the
multiplies.  These overheads mean that it's only when N/2 is above
`MUL_TOOM22_THRESHOLD' that divide and conquer is of use.

   `DC_DIV_QR_THRESHOLD' is based on the divisor size N, so it will be
somewhere above twice `MUL_TOOM22_THRESHOLD', but how much above
depends on the CPU.  An optimized `mpn_mul_basecase' can lower
`DC_DIV_QR_THRESHOLD' a little by offering a ready-made advantage over
repeated `mpn_submul_1' calls.

   Divide and conquer is asymptotically O(M(N)*log(N)) where M(N) is
the time for an NxN multiplication done with FFTs.  The actual time is
a sum over multiplications of the recursed sizes, as can be seen near
the end of section 2.2 of Burnikel and Ziegler.  For example, within
the Toom-3 range, divide and conquer is 2.63*M(N).  With higher
algorithms the M(N) term improves and the multiplier tends to log(N).
In practice, at moderate to large sizes, a 2NxN division is about 2 to
4 times slower than an NxN multiplication.


File: gmp.info,  Node: Block-Wise Barrett Division,  Next: Exact Division,  Prev: Divide and Conquer Division,  Up: Division Algorithms

16.2.4 Block-Wise Barrett Division
----------------------------------

For the largest divisions, a block-wise Barrett division algorithm is
used.  Here, the divisor is inverted to a precision determined by the
relative size of the dividend and divisor.  Blocks of quotient limbs
are then generated by multiplying blocks from the dividend by the
inverse.

   Our block-wise algorithm computes a smaller inverse than in the
plain Barrett algorithm.  For a 2n/n division, the inverse will be just
ceil(n/2) limbs.


File: gmp.info,  Node: Exact Division,  Next: Exact Remainder,  Prev: Block-Wise Barrett Division,  Up: Division Algorithms

16.2.5 Exact Division
---------------------

A so-called exact division is when the dividend is known to be an exact
multiple of the divisor.  Jebelean's exact division algorithm uses this
knowledge to make some significant optimizations (*note References::).

   The idea can be illustrated in decimal for example with 368154
divided by 543.  Because the low digit of the dividend is 4, the low
digit of the quotient must be 8.  This is arrived at from 4*7 mod 10,
using the fact 7 is the modular inverse of 3 (the low digit of the
divisor), since 3*7 == 1 mod 10.  So 8*543=4344 can be subtracted from
the dividend leaving 363810.  Notice the low digit has become zero.

   The procedure is repeated at the second digit, with the next
quotient digit 7 (7 == 1*7 mod 10), subtracting 7*543=3801, leaving
325800.  And finally at the third digit with quotient digit 6 (8*7 mod
10), subtracting 6*543=3258 leaving 0.  So the quotient is 678.

   Notice however that the multiplies and subtractions don't need to
extend past the low three digits of the dividend, since that's enough
to determine the three quotient digits.  For the last quotient digit no
subtraction is needed at all.  On a 2NxN division like this one, only
about half the work of a normal basecase division is necessary.

   For an NxM exact division producing Q=N-M quotient limbs, the saving
over a normal basecase division is in two parts.  Firstly, each of the
Q quotient limbs needs only one multiply, not a 2x1 divide and
multiply.  Secondly, the crossproducts are reduced when Q>M to
Q*M-M*(M+1)/2, or when Q<=M to Q*(Q-1)/2.  Notice the savings are
complementary.  If Q is big then many divisions are saved, or if Q is
small then the crossproducts reduce to a small number.

   The modular inverse used is calculated efficiently by `binvert_limb'
in `gmp-impl.h'.  This does four multiplies for a 32-bit limb, or six
for a 64-bit limb.  `tune/modlinv.c' has some alternate implementations
that might suit processors better at bit twiddling than multiplying.

   The sub-quadratic exact division described by Jebelean in "Exact
Division with Karatsuba Complexity" is not currently implemented.  It
uses a rearrangement similar to the divide and conquer for normal
division (*note Divide and Conquer Division::), but operating from low
to high.  A further possibility not currently implemented is
"Bidirectional Exact Integer Division" by Krandick and Jebelean which
forms quotient limbs from both the high and low ends of the dividend,
and can halve once more the number of crossproducts needed in a 2NxN
division.

   A special case exact division by 3 exists in `mpn_divexact_by3',
supporting Toom-3 multiplication and `mpq' canonicalizations.  It forms
quotient digits with a multiply by the modular inverse of 3 (which is
`0xAA..AAB') and uses two comparisons to determine a borrow for the next
limb.  The multiplications don't need to be on the dependent chain, as
long as the effect of the borrows is applied, which can help chips with
pipelined multipliers.


File: gmp.info,  Node: Exact Remainder,  Next: Small Quotient Division,  Prev: Exact Division,  Up: Division Algorithms

16.2.6 Exact Remainder
----------------------

If the exact division algorithm is done with a full subtraction at each
stage and the dividend isn't a multiple of the divisor, then low zero
limbs are produced but with a remainder in the high limbs.  For
dividend a, divisor d, quotient q, and b = 2^mp_bits_per_limb, this
remainder r is of the form

     a = q*d + r*b^n

   n represents the number of zero limbs produced by the subtractions,
that being the number of limbs produced for q.  r will be in the range
0<=r<d and can be viewed as a remainder, but one shifted up by a factor
of b^n.

   Carrying out full subtractions at each stage means the same number
of cross products must be done as a normal division, but there's still
some single limb divisions saved.  When d is a single limb some
simplifications arise, providing good speedups on a number of
processors.

   `mpn_divexact_by3', `mpn_modexact_1_odd' and the `mpn_redc_X'
functions differ subtly in how they return r, leading to some negations
in the above formula, but all are essentially the same.

   Clearly r is zero when a is a multiple of d, and this leads to
divisibility or congruence tests which are potentially more efficient
than a normal division.

   The factor of b^n on r can be ignored in a GCD when d is odd, hence
the use of `mpn_modexact_1_odd' by `mpn_gcd_1' and `mpz_kronecker_ui'
etc (*note Greatest Common Divisor Algorithms::).

   Montgomery's REDC method for modular multiplications uses operands
of the form of x*b^-n and y*b^-n and on calculating (x*b^-n)*(y*b^-n)
uses the factor of b^n in the exact remainder to reach a product in the
same form (x*y)*b^-n (*note Modular Powering Algorithm::).

   Notice that r generally gives no useful information about the
ordinary remainder a mod d since b^n mod d could be anything.  If
however b^n == 1 mod d, then r is the negative of the ordinary
remainder.  This occurs whenever d is a factor of b^n-1, as for example
with 3 in `mpn_divexact_by3'.  For a 32 or 64 bit limb other such
factors include 5, 17 and 257, but no particular use has been found for
this.


File: gmp.info,  Node: Small Quotient Division,  Prev: Exact Remainder,  Up: Division Algorithms

16.2.7 Small Quotient Division
------------------------------

An NxM division where the number of quotient limbs Q=N-M is small can
be optimized somewhat.

   An ordinary basecase division normalizes the divisor by shifting it
to make the high bit set, shifting the dividend accordingly, and
shifting the remainder back down at the end of the calculation.  This
is wasteful if only a few quotient limbs are to be formed.  Instead a
division of just the top 2*Q limbs of the dividend by the top Q limbs
of the divisor can be used to form a trial quotient.  This requires
only those limbs normalized, not the whole of the divisor and dividend.

   A multiply and subtract then applies the trial quotient to the M-Q
unused limbs of the divisor and N-Q dividend limbs (which includes Q
limbs remaining from the trial quotient division).  The starting trial
quotient can be 1 or 2 too big, but all cases of 2 too big and most
cases of 1 too big are detected by first comparing the most significant
limbs that will arise from the subtraction.  An addback is done if the
quotient still turns out to be 1 too big.

   This whole procedure is essentially the same as one step of the
basecase algorithm done in a Q limb base, though with the trial
quotient test done only with the high limbs, not an entire Q limb
"digit" product.  The correctness of this weaker test can be
established by following the argument of Knuth section 4.3.1 exercise
20 but with the v2*q>b*r+u2 condition appropriately relaxed.


File: gmp.info,  Node: Greatest Common Divisor Algorithms,  Next: Powering Algorithms,  Prev: Division Algorithms,  Up: Algorithms

16.3 Greatest Common Divisor
============================

* Menu:

* Binary GCD::
* Lehmer's Algorithm::
* Subquadratic GCD::
* Extended GCD::
* Jacobi Symbol::


File: gmp.info,  Node: Binary GCD,  Next: Lehmer's Algorithm,  Prev: Greatest Common Divisor Algorithms,  Up: Greatest Common Divisor Algorithms

16.3.1 Binary GCD
-----------------

At small sizes GMP uses an O(N^2) binary style GCD.  This is described
in many textbooks, for example Knuth section 4.5.2 algorithm B.  It
simply consists of successively reducing odd operands a and b using

     a,b = abs(a-b),min(a,b)
     strip factors of 2 from a

   The Euclidean GCD algorithm, as per Knuth algorithms E and A,
repeatedly computes the quotient q = floor(a/b) and replaces a,b by v,
u - q v. The binary algorithm has so far been found to be faster than
the Euclidean algorithm everywhere.  One reason the binary method does
well is that the implied quotient at each step is usually small, so
often only one or two subtractions are needed to get the same effect as
a division.  Quotients 1, 2 and 3 for example occur 67.7% of the time,
see Knuth section 4.5.3 Theorem E.

   When the implied quotient is large, meaning b is much smaller than
a, then a division is worthwhile.  This is the basis for the initial a
mod b reductions in `mpn_gcd' and `mpn_gcd_1' (the latter for both Nx1
and 1x1 cases).  But after that initial reduction, big quotients occur
too rarely to make it worth checking for them.


   The final 1x1 GCD in `mpn_gcd_1' is done in the generic C code as
described above.  For two N-bit operands, the algorithm takes about
0.68 iterations per bit.  For optimum performance some attention needs
to be paid to the way the factors of 2 are stripped from a.

   Firstly it may be noted that in twos complement the number of low
zero bits on a-b is the same as b-a, so counting or testing can begin on
a-b without waiting for abs(a-b) to be determined.

   A loop stripping low zero bits tends not to branch predict well,
since the condition is data dependent.  But on average there's only a
few low zeros, so an option is to strip one or two bits arithmetically
then loop for more (as done for AMD K6).  Or use a lookup table to get
a count for several bits then loop for more (as done for AMD K7).  An
alternative approach is to keep just one of a or b odd and iterate

     a,b = abs(a-b), min(a,b)
     a = a/2 if even
     b = b/2 if even

   This requires about 1.25 iterations per bit, but stripping of a
single bit at each step avoids any branching.  Repeating the bit strip
reduces to about 0.9 iterations per bit, which may be a worthwhile
tradeoff.

   Generally with the above approaches a speed of perhaps 6 cycles per
bit can be achieved, which is still not terribly fast with for instance
a 64-bit GCD taking nearly 400 cycles.  It's this sort of time which
means it's not usually advantageous to combine a set of divisibility
tests into a GCD.

   Currently, the binary algorithm is used for GCD only when N < 3.


File: gmp.info,  Node: Lehmer's Algorithm,  Next: Subquadratic GCD,  Prev: Binary GCD,  Up: Greatest Common Divisor Algorithms

16.3.2 Lehmer's algorithm
-------------------------

Lehmer's improvement of the Euclidean algorithms is based on the
observation that the initial part of the quotient sequence depends only
on the most significant parts of the inputs. The variant of Lehmer's
algorithm used in GMP splits off the most significant two limbs, as
suggested, e.g., in "A Double-Digit Lehmer-Euclid Algorithm" by
Jebelean (*note References::). The quotients of two double-limb inputs
are collected as a 2 by 2 matrix with single-limb elements. This is
done by the function `mpn_hgcd2'. The resulting matrix is applied to
the inputs using `mpn_mul_1' and `mpn_submul_1'. Each iteration usually
reduces the inputs by almost one limb. In the rare case of a large
quotient, no progress can be made by examining just the most
significant two limbs, and the quotient is computing using plain
division.

   The resulting algorithm is asymptotically O(N^2), just as the
Euclidean algorithm and the binary algorithm. The quadratic part of the
work are the calls to `mpn_mul_1' and `mpn_submul_1'. For small sizes,
the linear work is also significant. There are roughly N calls to the
`mpn_hgcd2' function. This function uses a couple of important
optimizations:

   * It uses the same relaxed notion of correctness as `mpn_hgcd' (see
     next section). This means that when called with the most
     significant two limbs of two large numbers, the returned matrix
     does not always correspond exactly to the initial quotient
     sequence for the two large numbers; the final quotient may
     sometimes be one off.

   * It takes advantage of the fact the quotients are usually small.
     The division operator is not used, since the corresponding
     assembler instruction is very slow on most architectures. (This
     code could probably be improved further, it uses many branches
     that are unfriendly to prediction).

   * It switches from double-limb calculations to single-limb
     calculations half-way through, when the input numbers have been
     reduced in size from two limbs to one and a half.



File: gmp.info,  Node: Subquadratic GCD,  Next: Extended GCD,  Prev: Lehmer's Algorithm,  Up: Greatest Common Divisor Algorithms

16.3.3 Subquadratic GCD
-----------------------

For inputs larger than `GCD_DC_THRESHOLD', GCD is computed via the HGCD
(Half GCD) function, as a generalization to Lehmer's algorithm.

   Let the inputs a,b be of size N limbs each. Put S = floor(N/2) + 1.
Then HGCD(a,b) returns a transformation matrix T with non-negative
elements, and reduced numbers (c;d) = T^-1 (a;b). The reduced numbers
c,d must be larger than S limbs, while their difference abs(c-d) must
fit in S limbs. The matrix elements will also be of size roughly N/2.

   The HGCD base case uses Lehmer's algorithm, but with the above stop
condition that returns reduced numbers and the corresponding
transformation matrix half-way through. For inputs larger than
`HGCD_THRESHOLD', HGCD is computed recursively, using the divide and
conquer algorithm in "On Scho"nhage's algorithm and subquadratic
integer GCD computation" by Mo"ller (*note References::). The recursive
algorithm consists of these main steps.

   * Call HGCD recursively, on the most significant N/2 limbs. Apply the
     resulting matrix T_1 to the full numbers, reducing them to a size
     just above 3N/2.

   * Perform a small number of division or subtraction steps to reduce
     the numbers to size below 3N/2. This is essential mainly for the
     unlikely case of large quotients.

   * Call HGCD recursively, on the most significant N/2 limbs of the
     reduced numbers. Apply the resulting matrix T_2 to the full
     numbers, reducing them to a size just above N/2.

   * Compute T = T_1 T_2.

   * Perform a small number of division and subtraction steps to
     satisfy the requirements, and return.

   GCD is then implemented as a loop around HGCD, similarly to Lehmer's
algorithm. Where Lehmer repeatedly chops off the top two limbs, calls
`mpn_hgcd2', and applies the resulting matrix to the full numbers, the
subquadratic GCD chops off the most significant third of the limbs (the
proportion is a tuning parameter, and 1/3 seems to be more efficient
than, e.g, 1/2), calls `mpn_hgcd', and applies the resulting matrix.
Once the input numbers are reduced to size below `GCD_DC_THRESHOLD',
Lehmer's algorithm is used for the rest of the work.

   The asymptotic running time of both HGCD and GCD is O(M(N)*log(N)),
where M(N) is the time for multiplying two N-limb numbers.


File: gmp.info,  Node: Extended GCD,  Next: Jacobi Symbol,  Prev: Subquadratic GCD,  Up: Greatest Common Divisor Algorithms

16.3.4 Extended GCD
-------------------

The extended GCD function, or GCDEXT, calculates gcd(a,b) and also
cofactors x and y satisfying a*x+b*y=gcd(a,b). All the algorithms used
for plain GCD are extended to handle this case. The binary algorithm is
used only for single-limb GCDEXT.  Lehmer's algorithm is used for sizes
up to `GCDEXT_DC_THRESHOLD'. Above this threshold, GCDEXT is
implemented as a loop around HGCD, but with more book-keeping to keep
track of the cofactors. This gives the same asymptotic running time as
for GCD and HGCD, O(M(N)*log(N))

   One difference to plain GCD is that while the inputs a and b are
reduced as the algorithm proceeds, the cofactors x and y grow in size.
This makes the tuning of the chopping-point more difficult. The current
code chops off the most significant half of the inputs for the call to
HGCD in the first iteration, and the most significant two thirds for
the remaining calls. This strategy could surely be improved. Also the
stop condition for the loop, where Lehmer's algorithm is invoked once
the inputs are reduced below `GCDEXT_DC_THRESHOLD', could maybe be
improved by taking into account the current size of the cofactors.


File: gmp.info,  Node: Jacobi Symbol,  Prev: Extended GCD,  Up: Greatest Common Divisor Algorithms

16.3.5 Jacobi Symbol
--------------------

`mpz_jacobi' and `mpz_kronecker' are currently implemented with a
simple binary algorithm similar to that described for the GCDs (*note
Binary GCD::).  They're not very fast when both inputs are large.
Lehmer's multi-step improvement or a binary based multi-step algorithm
is likely to be better.

   When one operand fits a single limb, and that includes
`mpz_kronecker_ui' and friends, an initial reduction is done with
either `mpn_mod_1' or `mpn_modexact_1_odd', followed by the binary
algorithm on a single limb.  The binary algorithm is well suited to a
single limb, and the whole calculation in this case is quite efficient.

   In all the routines sign changes for the result are accumulated
using some bit twiddling, avoiding table lookups or conditional jumps.

                                                                                                                                                                                                This is configure.info, produced by makeinfo version 4.8 from
./configure.texi.

INFO-DIR-SECTION GNU admin
START-INFO-DIR-ENTRY
* configure: (configure).	The GNU configure and build system
END-INFO-DIR-ENTRY

   This file documents the GNU configure and build system.

   Copyright (C) 1998 Cygnus Solutions.

   Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.

   Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided that
the entire resulting derived work is distributed under the terms of a
permission notice identical to this one.

   Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that this permission notice may be stated in a
translation approved by the Foundation.


File: configure.info,  Node: Top,  Next: Introduction,  Up: (dir)

GNU configure and build system
******************************

The GNU configure and build system.

* Menu:

* Introduction::		Introduction.
* Getting Started::		Getting Started.
* Files::			Files.
* Configuration Names::		Configuration Names.
* Cross Compilation Tools::	Cross Compilation Tools.
* Canadian Cross::		Canadian Cross.
* Cygnus Configure::		Cygnus Configure.
* Multilibs::			Multilibs.
* FAQ::				Frequently Asked Questions.
* Index::			Index.


File: configure.info,  Node: Introduction,  Next: Getting Started,  Prev: Top,  Up: Top

1 Introduction
**************

This document describes the GNU configure and build systems.  It
describes how autoconf, automake, libtool, and make fit together.  It
also includes a discussion of the older Cygnus configure system.

   This document does not describe in detail how to use each of the
tools; see the respective manuals for that.  Instead, it describes
which files the developer must write, which files are machine generated
and how they are generated, and where certain common problems should be
addressed.

   This document draws on several sources, including the autoconf
manual by David MacKenzie (*note autoconf overview: (autoconf)Top.),
the automake manual by David MacKenzie and Tom Tromey (*note automake
overview: (automake)Top.), the libtool manual by Gordon Matzigkeit
(*note libtool overview: (libtool)Top.), and the Cygnus configure
manual by K. Richard Pixley.

* Menu:

* Goals::			Goals.
* Tools::			The tools.
* History::			History.
* Building::			Building.


File: configure.info,  Node: Goals,  Next: Tools,  Up: Introduction

1.1 Goals
=========

The GNU configure and build system has two main goals.

   The first is to simplify the development of portable programs.  The
system permits the developer to concentrate on writing the program,
simplifying many details of portability across Unix and even Windows
systems, and permitting the developer to describe how to build the
program using simple rules rather than complex Makefiles.

   The second is to simplify the building of programs distributed as
source code.  All programs are built using a simple, standardized, two
step process.  The program builder need not install any special tools in
order to build the program.


File: configure.info,  Node: Tools,  Next: History,  Prev: Goals,  Up: Introduction

1.2 Tools
=========

The GNU configure and build system is comprised of several different
tools.  Program developers must build and install all of these tools.

   People who just want to build programs from distributed sources
normally do not need any special tools beyond a Unix shell, a make
program, and a C compiler.

autoconf
     provides a general portability framework, based on testing the
     features of the host system at build time.

automake
     a system for describing how to build a program, permitting the
     developer to write a simplified `Makefile'.

libtool
     a standardized approach to building shared libraries.

gettext
     provides a framework for translation of text messages into other
     languages; not really discussed in this document.

m4
     autoconf requires the GNU version of m4; the standard Unix m4 does
     not suffice.

perl
     automake requires perl.


File: configure.info,  Node: History,  Next: Building,  Prev: Tools,  Up: Introduction

1.3 History
===========

This is a very brief and probably inaccurate history.

   As the number of Unix variants increased during the 1980s, it became
harder to write programs which could run on all variants.  While it was
often possible to use `#ifdef' to identify particular systems,
developers frequently did not have access to every system, and the
characteristics of some systems changed from version to version.

   By 1992, at least three different approaches had been developed:
   * The Metaconfig program, by Larry Wall, Harlan Stenn, and Raphael
     Manfredi.

   * The Cygnus configure script, by K. Richard Pixley, and the gcc
     configure script, by Richard Stallman.  These use essentially the
     same approach, and the developers communicated regularly.

   * The autoconf program, by David MacKenzie.

   The Metaconfig program is still used for Perl and a few other
programs.  It is part of the Dist package.  I do not know if it is
being developed.

   In 1994, David MacKenzie and others modified autoconf to incorporate
all the features of Cygnus configure.  Since then, there has been a
slow but steady conversion of GNU programs from Cygnus configure to
autoconf. gcc has been converted, eliminating the gcc configure script.

   GNU autoconf was regularly maintained until late 1996.  As of this
writing in June, 1998, it has no public maintainer.

   Most programs are built using the make program, which requires the
developer to write Makefiles describing how to build the programs.
Since most programs are built in pretty much the same way, this led to a
lot of duplication.

   The X Window system is built using the imake tool, which uses a
database of rules to eliminate the duplication.  However, building a
tool whic