ange Logs.
                                                              (line   6)
* character set:                         Character Set.       (line   6)
* command-line arguments, decoding:      Semantics.           (line  46)
* command-line interface:                Command-Line Interfaces.
                                                              (line   6)
* commenting:                            Comments.            (line   6)
* compatibility with C and POSIX standards: Compatibility.    (line   6)
* compiler warnings:                     Syntactic Conventions.
                                                              (line  10)
* conditional changes, and change logs:  Conditional Changes. (line   6)
* conditionals, comments for:            Comments.            (line  60)
* configure:                             Configuration.       (line   6)
* control-L:                             Formatting.          (line 118)
* conventions for makefiles:             Makefile Conventions.
                                                              (line   6)
* CORBA:                                 Graphical Interfaces.
                                                              (line  16)
* credits for manuals:                   Manual Credits.      (line   6)
* D-bus:                                 Graphical Interfaces.
                                                              (line  16)
* data types, and portability:           CPU Portability.     (line   6)
* declaration for system functions:      System Functions.    (line  21)
* DESTDIR:                               DESTDIR.             (line   6)
* documentation:                         Documentation.       (line   6)
* doschk:                                Names.               (line  38)
* downloading this manual:               Preface.             (line  14)
* encodings:                             Character Set.       (line   6)
* error messages:                        Semantics.           (line  19)
* error messages, formatting:            Errors.              (line   6)
* exec_prefix:                           Directory Variables. (line  36)
* expressions, splitting:                Formatting.          (line  81)
* FDL, GNU Free Documentation License:   GNU Free Documentation License.
                                                              (line   6)
* file usage:                            File Usage.          (line   6)
* file-name limitations:                 Names.               (line  38)
* formatting error messages:             Errors.              (line   6)
* formatting source code:                Formatting.          (line   6)
* formfeed:                              Formatting.          (line 118)
* function argument, declaring:          Syntactic Conventions.
                                                              (line   6)
* function prototypes:                   Standard C.          (line  17)
* getopt:                                Command-Line Interfaces.
                                                              (line   6)
* gettext:                               Internationalization.
                                                              (line   6)
* GNOME:                                 Graphical Interfaces.
                                                              (line  16)
* GNOME and Guile:                       Source Language.     (line  38)
* gnustandards project repository:       Preface.             (line  30)
* gnustandards-commit@gnu.org mailing list: Preface.          (line  24)
* graphical user interface:              Graphical Interfaces.
                                                              (line   6)
* grave accent:                          Quote Characters.    (line   6)
* GTK+:                                  Graphical Interfaces.
                                                              (line   6)
* Guile:                                 Source Language.     (line  38)
* implicit int:                          Syntactic Conventions.
                                                              (line   6)
* impossible conditions:                 Semantics.           (line  70)
* installations, staged:                 DESTDIR.             (line   6)
* interface styles:                      Graphical Interfaces.
                                                              (line   6)
* internationalization:                  Internationalization.
                                                              (line   6)
* keyboard interface:                    Graphical Interfaces.
                                                              (line  16)
* LDAP:                                  OID Allocations.     (line   6)
* left quote:                            Quote Characters.    (line   6)
* legal aspects:                         Legal Issues.        (line   6)
* legal papers:                          Contributions.       (line   6)
* libexecdir:                            Directory Variables. (line  67)
* libraries:                             Libraries.           (line   6)
* library functions, and portability:    System Functions.    (line   6)
* library interface:                     Graphical Interfaces.
                                                              (line  16)
* license for manuals:                   License for Manuals. (line   6)
* lint:                                  Syntactic Conventions.
                                                              (line 109)
* locale-specific quote characters:      Quote Characters.    (line   6)
* long option names:                     Option Table.        (line   6)
* long-named options:                    Command-Line Interfaces.
                                                              (line  12)
* makefile, conventions for:             Makefile Conventions.
                                                              (line   6)
* malloc return value:                   Semantics.           (line  25)
* man pages:                             Man Pages.           (line   6)
* manual structure:                      Manual Structure Details.
                                                              (line   6)
* memory allocation failure:             Semantics.           (line  25)
* memory usage:                          Memory Usage.        (line   6)
* message text, and internationalization: Internationalization.
                                                              (line  29)
* mmap:                                  Mmap.                (line   6)
* multiple variables in a line:          Syntactic Conventions.
                                                              (line  35)
* names of variables, functions, and files: Names.            (line   6)
* NEWS file:                             NEWS File.           (line   6)
* non-ASCII characters:                  Character Set.       (line   6)
* non-POSIX systems, and portability:    System Portability.  (line  32)
* non-standard extensions:               Using Extensions.    (line   6)
* NUL characters:                        Semantics.           (line  11)
* OID allocations for GNU:               OID Allocations.     (line   6)
* open brace:                            Formatting.          (line   6)
* optional features, configure-time:     Configuration.       (line 100)
* options for compatibility:             Compatibility.       (line  14)
* options, standard command-line:        Command-Line Interfaces.
                                                              (line  31)
* output device and program's behavior:  User Interfaces.     (line  13)
* packaging:                             Releases.            (line   6)
* PATH_INFO, specifying standard options as: Command-Line Interfaces.
                                                              (line  31)
* portability, and data types:           CPU Portability.     (line   6)
* portability, and library functions:    System Functions.    (line   6)
* portability, between system types:     System Portability.  (line   6)
* POSIX compatibility:                   Compatibility.       (line   6)
* POSIXLY_CORRECT, environment variable: Compatibility.       (line  21)
* post-installation commands:            Install Command Categories.
                                                              (line   6)
* pre-installation commands:             Install Command Categories.
                                                              (line   6)
* prefix:                                Directory Variables. (line  26)
* program configuration:                 Configuration.       (line   6)
* program design:                        Design Advice.       (line   6)
* program name and its behavior:         User Interfaces.     (line   6)
* program's canonical name:              --version.           (line  12)
* programming languages:                 Source Language.     (line   6)
* proprietary programs:                  Reading Non-Free Code.
                                                              (line   6)
* quote characters:                      Quote Characters.    (line   6)
* README file:                           Releases.            (line  21)
* references to non-free material:       References.          (line   6)
* releasing:                             Managing Releases.   (line   6)
* Savannah repository for gnustandards:  Preface.             (line  30)
* sbindir:                               Directory Variables. (line  60)
* signal handling:                       Semantics.           (line  59)
* SNMP:                                  OID Allocations.     (line   6)
* spaces before open-paren:              Formatting.          (line  75)
* staged installs:                       DESTDIR.             (line   6)
* standard command-line options:         Command-Line Interfaces.
                                                              (line  31)
* standards for makefiles:               Makefile Conventions.
                                                              (line   6)
* string library functions:              System Functions.    (line  55)
* syntactic conventions:                 Syntactic Conventions.
                                                              (line   6)
* table of long options:                 Option Table.        (line   6)
* temporary files:                       Semantics.           (line  84)
* temporary variables:                   Syntactic Conventions.
                                                              (line  23)
* texinfo.tex, in a distribution:        Releases.            (line  70)
* TMPDIR environment variable:           Semantics.           (line  84)
* trademarks:                            Trademarks.          (line   6)
* user interface styles:                 Graphical Interfaces.
                                                              (line   6)
* where to obtain standards.texi:        Preface.             (line  14)
* X.509:                                 OID Allocations.     (line   6)



Tag Table:
Node: Top814
Node: Preface2089
Node: Legal Issues4802
Node: Reading Non-Free Code5272
Node: Contributions7002
Node: Trademarks9240
Node: Design Advice10875
Node: Source Language11467
Node: Compatibility13593
Node: Using Extensions15221
Node: Standard C16797
Node: Conditional Compilation19200
Node: Program Behavior20598
Node: Non-GNU Standards21714
Node: Semantics23995
Node: Libraries28715
Node: Errors29960
Node: User Interfaces32453
Node: Graphical Interfaces34058
Node: Command-Line Interfaces35242
Node: --version37274
Node: --help43011
Node: Option Table43884
Node: OID Allocations58839
Node: Memory Usage60636
Node: File Usage61672
Node: Writing C62422
Node: Formatting63394
Node: Comments67683
Node: Syntactic Conventions71235
Node: Names74697
Node: System Portability76909
Node: CPU Portability79800
Node: System Functions83701
Node: Internationalization88898
Node: Character Set92892
Node: Quote Characters93705
Node: Mmap95225
Node: Documentation95933
Node: GNU Manuals97039
Node: Doc Strings and Manuals102777
Node: Manual Structure Details104330
Node: License for Manuals105748
Node: Manual Credits106722
Node: Printed Manuals107115
Node: NEWS File107801
Node: Change Logs108479
Node: Change Log Concepts109233
Node: Style of Change Logs111336
Node: Simple Changes113836
Node: Conditional Changes115278
Node: Indicating the Part Changed116700
Node: Man Pages117227
Node: Reading other Manuals119433
Node: Managing Releases120224
Node: Configuration121005
Node: Makefile Conventions129670
Node: Makefile Basics130552
Node: Utilities in Makefiles133726
Node: Command Variables135871
Node: DESTDIR139093
Node: Directory Variables141242
Node: Standard Targets155735
Ref: Standard Targets-Footnote-1169250
Node: Install Command Categories169350
Node: Releases173883
Node: References177888
Node: GNU Free Documentation License183735
Node: Index208902

End Tag Table
                                                                                                                                                                                                                                   This is ../../gmp/doc/gmp.info, produced by makeinfo version 4.8 from
../../gmp/doc/gmp.texi.

   This manual describes how to install and use the GNU multiple
precision arithmetic library, version 5.0.1.

   Copyright 1991, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010 Free
Software Foundation, Inc.

   Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License, Version
1.3 or any later version published by the Free Software Foundation;
with no Invariant Sections, with the Front-Cover Texts being "A GNU
Manual", and with the Back-Cover Texts being "You have freedom to copy
and modify this GNU Manual, like GNU software".  A copy of the license
is included in *Note GNU Free Documentation License::.

INFO-DIR-SECTION GNU libraries
START-INFO-DIR-ENTRY
* gmp: (gmp).                   GNU Multiple Precision Arithmetic Library.
END-INFO-DIR-ENTRY


File: gmp.info,  Node: Powering Algorithms,  Next: Root Extraction Algorithms,  Prev: Greatest Common Divisor Algorithms,  Up: Algorithms

16.4 Powering Algorithms
========================

* Menu:

* Normal Powering Algorithm::
* Modular Powering Algorithm::


File: gmp.info,  Node: Normal Powering Algorithm,  Next: Modular Powering Algorithm,  Prev: Powering Algorithms,  Up: Powering Algorithms

16.4.1 Normal Powering
----------------------

Normal `mpz' or `mpf' powering uses a simple binary algorithm,
successively squaring and then multiplying by the base when a 1 bit is
seen in the exponent, as per Knuth section 4.6.3.  The "left to right"
variant described there is used rather than algorithm A, since it's
just as easy and can be done with somewhat less temporary memory.


File: gmp.info,  Node: Modular Powering Algorithm,  Prev: Normal Powering Algorithm,  Up: Powering Algorithms

16.4.2 Modular Powering
-----------------------

Modular powering is implemented using a 2^k-ary sliding window
algorithm, as per "Handbook of Applied Cryptography" algorithm 14.85
(*note References::).  k is chosen according to the size of the
exponent.  Larger exponents use larger values of k, the choice being
made to minimize the average number of multiplications that must
supplement the squaring.

   The modular multiplies and squares use either a simple division or
the REDC method by Montgomery (*note References::).  REDC is a little
faster, essentially saving N single limb divisions in a fashion similar
to an exact remainder (*note Exact Remainder::).


File: gmp.info,  Node: Root Extraction Algorithms,  Next: Radix Conversion Algorithms,  Prev: Powering Algorithms,  Up: Algorithms

16.5 Root Extraction Algorithms
===============================

* Menu:

* Square Root Algorithm::
* Nth Root Algorithm::
* Perfect Square Algorithm::
* Perfect Power Algorithm::


File: gmp.info,  Node: Square Root Algorithm,  Next: Nth Root Algorithm,  Prev: Root Extraction Algorithms,  Up: Root Extraction Algorithms

16.5.1 Square Root
------------------

Square roots are taken using the "Karatsuba Square Root" algorithm by
Paul Zimmermann (*note References::).

   An input n is split into four parts of k bits each, so with b=2^k we
have n = a3*b^3 + a2*b^2 + a1*b + a0.  Part a3 must be "normalized" so
that either the high or second highest bit is set.  In GMP, k is kept
on a limb boundary and the input is left shifted (by an even number of
bits) to normalize.

   The square root of the high two parts is taken, by recursive
application of the algorithm (bottoming out in a one-limb Newton's
method),

     s1,r1 = sqrtrem (a3*b + a2)

   This is an approximation to the desired root and is extended by a
division to give s,r,

     q,u = divrem (r1*b + a1, 2*s1)
     s = s1*b + q
     r = u*b + a0 - q^2

   The normalization requirement on a3 means at this point s is either
correct or 1 too big.  r is negative in the latter case, so

     if r < 0 then
       r = r + 2*s - 1
       s = s - 1

   The algorithm is expressed in a divide and conquer form, but as
noted in the paper it can also be viewed as a discrete variant of
Newton's method, or as a variation on the schoolboy method (no longer
taught) for square roots two digits at a time.

   If the remainder r is not required then usually only a few high limbs
of r and u need to be calculated to determine whether an adjustment to
s is required.  This optimization is not currently implemented.

   In the Karatsuba multiplication range this algorithm is
O(1.5*M(N/2)), where M(n) is the time to multiply two numbers of n
limbs.  In the FFT multiplication range this grows to a bound of
O(6*M(N/2)).  In practice a factor of about 1.5 to 1.8 is found in the
Karatsuba and Toom-3 ranges, growing to 2 or 3 in the FFT range.

   The algorithm does all its calculations in integers and the resulting
`mpn_sqrtrem' is used for both `mpz_sqrt' and `mpf_sqrt'.  The extended
precision given by `mpf_sqrt_ui' is obtained by padding with zero limbs.


File: gmp.info,  Node: Nth Root Algorithm,  Next: Perfect Square Algorithm,  Prev: Square Root Algorithm,  Up: Root Extraction Algorithms

16.5.2 Nth Root
---------------

Integer Nth roots are taken using Newton's method with the following
iteration, where A is the input and n is the root to be taken.

              1         A
     a[i+1] = - * ( --------- + (n-1)*a[i] )
              n     a[i]^(n-1)

   The initial approximation a[1] is generated bitwise by successively
powering a trial root with or without new 1 bits, aiming to be just
above the true root.  The iteration converges quadratically when
started from a good approximation.  When n is large more initial bits
are needed to get good convergence.  The current implementation is not
particularly well optimized.


File: gmp.info,  Node: Perfect Square Algorithm,  Next: Perfect Power Algorithm,  Prev: Nth Root Algorithm,  Up: Root Extraction Algorithms

16.5.3 Perfect Square
---------------------

A significant fraction of non-squares can be quickly identified by
checking whether the input is a quadratic residue modulo small integers.

   `mpz_perfect_square_p' first tests the input mod 256, which means
just examining the low byte.  Only 44 different values occur for
squares mod 256, so 82.8% of inputs can be immediately identified as
non-squares.

   On a 32-bit system similar tests are done mod 9, 5, 7, 13 and 17,
for a total 99.25% of inputs identified as non-squares.  On a 64-bit
system 97 is tested too, for a total 99.62%.

   These moduli are chosen because they're factors of 2^24-1 (or 2^48-1
for 64-bits), and such a remainder can be quickly taken just using
additions (see `mpn_mod_34lsub1').

   When nails are in use moduli are instead selected by the `gen-psqr.c'
program and applied with an `mpn_mod_1'.  The same 2^24-1 or 2^48-1
could be done with nails using some extra bit shifts, but this is not
currently implemented.

   In any case each modulus is applied to the `mpn_mod_34lsub1' or
`mpn_mod_1' remainder and a table lookup identifies non-squares.  By
using a "modexact" style calculation, and suitably permuted tables,
just one multiply each is required, see the code for details.  Moduli
are also combined to save operations, so long as the lookup tables
don't become too big.  `gen-psqr.c' does all the pre-calculations.

   A square root must still be taken for any value that passes these
tests, to verify it's really a square and not one of the small fraction
of non-squares that get through (ie. a pseudo-square to all the tested
bases).

   Clearly more residue tests could be done, `mpz_perfect_square_p' only
uses a compact and efficient set.  Big inputs would probably benefit
from more residue testing, small inputs might be better off with less.
The assumed distribution of squares versus non-squares in the input
would affect such considerations.


File: gmp.info,  Node: Perfect Power Algorithm,  Prev: Perfect Square Algorithm,  Up: Root Extraction Algorithms

16.5.4 Perfect Power
--------------------

Detecting perfect powers is required by some factorization algorithms.
Currently `mpz_perfect_power_p' is implemented using repeated Nth root
extractions, though naturally only prime roots need to be considered.
(*Note Nth Root Algorithm::.)

   If a prime divisor p with multiplicity e can be found, then only
roots which are divisors of e need to be considered, much reducing the
work necessary.  To this end divisibility by a set of small primes is
checked.


File: gmp.info,  Node: Radix Conversion Algorithms,  Next: Other Algorithms,  Prev: Root Extraction Algorithms,  Up: Algorithms

16.6 Radix Conversion
=====================

Radix conversions are less important than other algorithms.  A program
dominated by conversions should probably use a different data
representation.

* Menu:

* Binary to Radix::
* Radix to Binary::


File: gmp.info,  Node: Binary to Radix,  Next: Radix to Binary,  Prev: Radix Conversion Algorithms,  Up: Radix Conversion Algorithms

16.6.1 Binary to Radix
----------------------

Conversions from binary to a power-of-2 radix use a simple and fast
O(N) bit extraction algorithm.

   Conversions from binary to other radices use one of two algorithms.
Sizes below `GET_STR_PRECOMPUTE_THRESHOLD' use a basic O(N^2) method.
Repeated divisions by b^n are made, where b is the radix and n is the
biggest power that fits in a limb.  But instead of simply using the
remainder r from such divisions, an extra divide step is done to give a
fractional limb representing r/b^n.  The digits of r can then be
extracted using multiplications by b rather than divisions.  Special
case code is provided for decimal, allowing multiplications by 10 to
optimize to shifts and adds.

   Above `GET_STR_PRECOMPUTE_THRESHOLD' a sub-quadratic algorithm is
used.  For an input t, powers b^(n*2^i) of the radix are calculated,
until a power between t and sqrt(t) is reached.  t is then divided by
that largest power, giving a quotient which is the digits above that
power, and a remainder which is those below.  These two parts are in
turn divided by the second highest power, and so on recursively.  When
a piece has been divided down to less than `GET_STR_DC_THRESHOLD'
limbs, the basecase algorithm described above is used.

   The advantage of this algorithm is that big divisions can make use
of the sub-quadratic divide and conquer division (*note Divide and
Conquer Division::), and big divisions tend to have less overheads than
lots of separate single limb divisions anyway.  But in any case the
cost of calculating the powers b^(n*2^i) must first be overcome.

   `GET_STR_PRECOMPUTE_THRESHOLD' and `GET_STR_DC_THRESHOLD' represent
the same basic thing, the point where it becomes worth doing a big
division to cut the input in half.  `GET_STR_PRECOMPUTE_THRESHOLD'
includes the cost of calculating the radix power required, whereas
`GET_STR_DC_THRESHOLD' assumes that's already available, which is the
case when recursing.

   Since the base case produces digits from least to most significant
but they want to be stored from most to least, it's necessary to
calculate in advance how many digits there will be, or at least be sure
not to underestimate that.  For GMP the number of input bits is
multiplied by `chars_per_bit_exactly' from `mp_bases', rounding up.
The result is either correct or one too big.

   Examining some of the high bits of the input could increase the
chance of getting the exact number of digits, but an exact result every
time would not be practical, since in general the difference between
numbers 100... and 99... is only in the last few bits and the work to
identify 99...  might well be almost as much as a full conversion.

   `mpf_get_str' doesn't currently use the algorithm described here, it
multiplies or divides by a power of b to move the radix point to the
just above the highest non-zero digit (or at worst one above that
location), then multiplies by b^n to bring out digits.  This is O(N^2)
and is certainly not optimal.

   The r/b^n scheme described above for using multiplications to bring
out digits might be useful for more than a single limb.  Some brief
experiments with it on the base case when recursing didn't give a
noticeable improvement, but perhaps that was only due                !  "  #  $  %  &  '  (  )  *  +  ,  -  .  /  0  1  2  3  4  5  6  7  8  9  :  ;  <  =  >  ?  @  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  [  \  ]  ^  _  `  a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z  {  |  }  ~    �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �                                                                                                                                                                                                                                                                                                                                      to the
implementation.  Something similar would work for the sub-quadratic
divisions too, though there would be the cost of calculating a bigger
radix power.

   Another possible improvement for the sub-quadratic part would be to
arrange for radix powers that balanced the sizes of quotient and
remainder produced, ie. the highest power would be an b^(n*k)
approximately equal to sqrt(t), not restricted to a 2^i factor.  That
ought to smooth out a graph of times against sizes, but may or may not
be a net speedup.


File: gmp.info,  Node: Radix to Binary,  Prev: Binary to Radix,  Up: Radix Conversion Algorithms

16.6.2 Radix to Binary
----------------------

*This section needs to be rewritten, it currently describes the
algorithms used before GMP 4.3.*

   Conversions from a power-of-2 radix into binary use a simple and fast
O(N) bitwise concatenation algorithm.

   Conversions from other radices use one of two algorithms.  Sizes
below `SET_STR_PRECOMPUTE_THRESHOLD' use a basic O(N^2) method.  Groups
of n digits are converted to limbs, where n is the biggest power of the
base b which will fit in a limb, then those groups are accumulated into
the result by multiplying by b^n and adding.  This saves
multi-precision operations, as per Knuth section 4.4 part E (*note
References::).  Some special case code is provided for decimal, giving
the compiler a chance to optimize multiplications by 10.

   Above `SET_STR_PRECOMPUTE_THRESHOLD' a sub-quadratic algorithm is
used.  First groups of n digits are converted into limbs.  Then adjacent
limbs are combined into limb pairs with x*b^n+y, where x and y are the
limbs.  Adjacent limb pairs are combined into quads similarly with
x*b^(2n)+y.  This continues until a single block remains, that being
the result.

   The advantage of this method is that the multiplications for each x
are big blocks, allowing Karatsuba and higher algorithms to be used.
But the cost of calculating the powers b^(n*2^i) must be overcome.
`SET_STR_PRECOMPUTE_THRESHOLD' usually ends up quite big, around 5000
digits, and on some processors much bigger still.

   `SET_STR_PRECOMPUTE_THRESHOLD' is based on the input digits (and
tuned for decimal), though it might be better based on a limb count, so
as to be independent of the base.  But that sort of count isn't used by
the base case and so would need some sort of initial calculation or
estimate.

   The main reason `SET_STR_PRECOMPUTE_THRESHOLD' is so much bigger
than the corresponding `GET_STR_PRECOMPUTE_THRESHOLD' is that
`mpn_mul_1' is much faster than `mpn_divrem_1' (often by a factor of 5,
or more).


File: gmp.info,  Node: Other Algorithms,  Next: Assembly Coding,  Prev: Radix Conversion Algorithms,  Up: Algorithms

16.7 Other Algorithms
=====================

* Menu:

* Prime Testing Algorithm::
* Factorial Algorithm::
* Binomial Coefficients Algorithm::
* Fibonacci Numbers Algorithm::
* Lucas Numbers Algorithm::
* Random Number Algorithms::


File: gmp.info,  Node: Prime Testing Algorithm,  Next: Factorial Algorithm,  Prev: Other Algorithms,  Up: Other Algorithms

16.7.1 Prime Testing
--------------------

The primality testing in `mpz_probab_prime_p' (*note Number Theoretic
Functions::) first does some trial division by small factors and then
uses the Miller-Rabin probabilistic primality testing algorithm, as
described in Knuth section 4.5.4 algorithm P (*note References::).

   For an odd input n, and with n = q*2^k+1 where q is odd, this
algorithm selects a random base x and tests whether x^q mod n is 1 or
-1, or an x^(q*2^j) mod n is 1, for 1<=j<=k.  If so then n is probably
prime, if not then n is definitely composite.

   Any prime n will pass the test, but some composites do too.  Such
composites are known as strong pseudoprimes to base x.  No n is a
strong pseudoprime to more than 1/4 of all bases (see Knuth exercise
22), hence with x chosen at random there's no more than a 1/4 chance a
"probable prime" will in fact be composite.

   In fact strong pseudoprimes are quite rare, making the test much more
powerful than this analysis would suggest, but 1/4 is all that's proven
for an arbitrary n.


File: gmp.info,  Node: Factorial Algorithm,  Next: Binomial Coefficients Algorithm,  Prev: Prime Testing Algorithm,  Up: Other Algorithms

16.7.2 Factorial
----------------

Factorials are calculated by a combination of removal of twos,
powering, and binary splitting.  The procedure can be best illustrated
with an example,

     23! = 1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23

has factors of two removed,

     23! = 2^19.1.1.3.1.5.3.7.1.9.5.11.3.13.7.15.1.17.9.19.5.21.11.23

and the resulting terms collected up according to their multiplicity,

     23! = 2^19.(3.5)^3.(7.9.11)^2.(13.15.17.19.21.23)

   Each sequence such as 13.15.17.19.21.23 is evaluated by splitting
into every second term, as for instance (13.17.21).(15.19.23), and the
same recursively on each half.  This is implemented iteratively using
some bit twiddling.

   Such splitting is more efficient than repeated Nx1 multiplies since
it forms big multiplies, allowing Karatsuba and higher algorithms to be
used.  And even below the Karatsuba threshold a big block of work can
be more efficient for the basecase algorithm.

   Splitting into subsequences of every second term keeps the resulting
products more nearly equal in size than would the simpler approach of
say taking the first half and second half of the sequence.  Nearly
equal products are more efficient for the current multiply
implementation.


File: gmp.info,  Node: Binomial Coefficients Algorithm,  Next: Fibonacci Numbers Algorithm,  Prev: Factorial Algorithm,  Up: Other Algorithms

16.7.3 Binomial Coefficients
----------------------------

Binomial coefficients C(n,k) are calculated by first arranging k <= n/2
using C(n,k) = C(n,n-k) if necessary, and then evaluating the following
product simply from i=2 to i=k.

                           k  (n-k+i)
     C(n,k) =  (n-k+1) * prod -------
                          i=2    i

   It's easy to show that each denominator i will divide the product so
far, so the exact division algorithm is used (*note Exact Division::).

   The numerators n-k+i and denominators i are first accumulated into
as many fit a limb, to save multi-precision operations, though for
`mpz_bin_ui' this applies only to the divisors, since n is an `mpz_t'
and n-k+i in general won't fit in a limb at all.


File: gmp.info,  Node: Fibonacci Numbers Algorithm,  Next: Lucas Numbers Algorithm,  Prev: Binomial Coefficients Algorithm,  Up: Other Algorithms

16.7.4 Fibonacci Numbers
------------------------

The Fibonacci functions `mpz_fib_ui' and `mpz_fib2_ui' are designed for
calculating isolated F[n] or F[n],F[n-1] values efficiently.

   For small n, a table of single limb values in `__gmp_fib_table' is
used.  On a 32-bit limb this goes up to F[47], or on a 64-bit limb up
to F[93].  For convenience the table starts at F[-1].

   Beyond the table, values are generated with a binary powering
algorithm, calculating a pair F[n] and F[n-1] working from high to low
across the bits of n.  The formulas used are

     F[2k+1] = 4*F[k]^2 - F[k-1]^2 + 2*(-1)^k
     F[2k-1] =   F[k]^2 + F[k-1]^2

     F[2k] = F[2k+1] - F[2k-1]

   At each step, k is the high b bits of n.  If the next bit of n is 0
then F[2k],F[2k-1] is used, or if it's a 1 then F[2k+1],F[2k] is used,
and the process repeated until all bits of n are incorporated.  Notice
these formulas require just two squares per bit of n.

   It'd be possible to handle the first few n above the single limb
table with simple additions, using the defining Fibonacci recurrence
F[k+1]=F[k]+F[k-1], but this is not done since it usually turns out to
be faster for only about 10 or 20 values of n, and including a block of
code for just those doesn't seem worthwhile.  If they really mattered
it'd be better to extend the data table.

   Using a table avoids lots of calculations on small numbers, and
makes small n go fast.  A bigger table would make more small n go fast,
it's just a question of balancing size against desired speed.  For GMP
the code is kept compact, with the emphasis primarily on a good
powering algorithm.

   `mpz_fib2_ui' returns both F[n] and F[n-1], but `mpz_fib_ui' is only
interested in F[n].  In this case the last step of the algorithm can
become one multiply instead of two squares.  One of the following two
formulas is used, according as n is odd or even.

     F[2k]   = F[k]*(F[k]+2F[k-1])

     F[2k+1] = (2F[k]+F[k-1])*(2F[k]-F[k-1]) + 2*(-1)^k

   F[2k+1] here is the same as above, just rearranged to be a multiply.
For interest, the 2*(-1)^k term both here and above can be applied
just to the low limb of the calculation, without a carry or borrow into
further limbs, which saves some code size.  See comments with
`mpz_fib_ui' and the internal `mpn_fib2_ui' for how this is done.


File: gmp.info,  Node: Lucas Numbers Algorithm,  Next: Random Number Algorithms,  Prev: Fibonacci Numbers Algorithm,  Up: Other Algorithms

16.7.5 Lucas Numbers
--------------------

`mpz_lucnum2_ui' derives a pair of Lucas numbers from a pair of
Fibonacci numbers with the following simple formulas.

     L[k]   =   F[k] + 2*F[k-1]
     L[k-1] = 2*F[k] -   F[k-1]

   `mpz_lucnum_ui' is only interested in L[n], and some work can be
saved.  Trailing zero bits on n can be handled with a single square
each.

     L[2k] = L[k]^2 - 2*(-1)^k

   And the lowest 1 bit can be handled with one multiply of a pair of
Fibonacci numbers, similar to what `mpz_fib_ui' does.

     L[2k+1] = 5*F[k-1]*(2*F[k]+F[k-1]) - 4*(-1)^k


File: gmp.info,  Node: Random Number Algorithms,  Prev: Lucas Numbers Algorithm,  Up: Other Algorithms

16.7.6 Random Numbers
---------------------

For the `urandomb' functions, random numbers are generated simply by
concatenating bits produced by the generator.  As long as the generator
has good randomness properties this will produce well-distributed N bit
numbers.

   For the `urandomm' functions, random numbers in a range 0<=R<N are
generated by taking values R of ceil(log2(N)) bits each until one
satisfies R<N.  This will normally require only one or two attempts,
but the attempts are limited in case the generator is somehow
degenerate and produces only 1 bits or similar.

   The Mersenne Twister generator is by Matsumoto and Nishimura (*note
References::).  It has a non-repeating period of 2^19937-1, which is a
Mersenne prime, hence the name of the generator.  The state is 624
words of 32-bits each, which is iterated with one XOR and shift for each
32-bit word generated, making the algorithm very fast.  Randomness
properties are also very good and this is the default algorithm used by
GMP.

   Linear congruential generators are described in many text books, for
instance Knuth volume 2 (*note References::).  With a modulus M and
parameters A and C, a integer state S is iterated by the formula S <-
A*S+C mod M.  At each step the new state is a linear function of the
previous, mod M, hence the name of the generator.

   In GMP only moduli of the form 2^N are supported, and the current
implementation is not as well optimized as it could be.  Overheads are
significant when N is small, and when N is large clearly the multiply
at each step will become slow.  This is not a big concern, since the
Mersenne Twister generator is better in every respect and is therefore
recommended for all normal applications.

   For both generators the current state can be deduced by observing
enough output and applying some linear algebra (over GF(2) in the case
of the Mersenne Twister).  This generally means raw output is
unsuitable for cryptographic applications without further hashing or
the like.


File: gmp.info,  Node: Assembly Coding,  Prev: Other Algorithms,  Up: Algorithms

16.8 Assembly Coding
====================

The assembly subroutines in GMP are the most significant source of
speed at small to moderate sizes.  At larger sizes algorithm selection
becomes more important, but of course speedups in low level routines
will still speed up everything proportionally.

   Carry handling and widening multiplies that are important for GMP
can't be easily expressed in C.  GCC `asm' blocks help a lot and are
provided in `longlong.h', but hand coding low level routines invariably
offers a speedup over generic C by a factor of anything from 2 to 10.

* Menu:

* Assembly Code Organisation::
* Assembly Basics::
* Assembly Carry Propagation::
* Assembly Cache Handling::
* Assembly Functional Units::
* Assembly Floating Point::
* Assembly SIMD Instructions::
* Assembly Software Pipelining::
* Assembly Loop Unrolling::
* Assembly Writing Guide::


File: gmp.info,  Node: Assembly Code Organisation,  Next: Assembly Basics,  Prev: Assembly Coding,  Up: Assembly Coding

16.8.1 Code Organisation
------------------------

The various `mpn' subdirectories contain machine-dependent code, written
in C or assembly.  The `mpn/generic' subdirectory contains default code,
used when there's no machine-specific version of a particular file.

   Each `mpn' subdirectory is for an ISA family.  Generally 32-bit and
64-bit variants in a family cannot share code and have separate
directories.  Within a family further subdirectories may exist for CPU
variants.

   In each directory a `nails' subdirectory may exist, holding code with
nails support for that CPU variant.  A `NAILS_SUPPORT' directive in each
file indicates the nails values the code handles.  Nails code only
exists where it's faster, or promises to be faster, than plain code.
There's no effort put into nails if they're not going to enhance a
given CPU.


File: gmp.info,  Node: Assembly Basics,  Next: Assembly Carry Propagation,  Prev: Assembly Code Organisation,  Up: Assembly Coding

16.8.2 Assembly Basics
----------------------

`mpn_addmul_1' and `mpn_submul_1' are the most important routines for
overall GMP performance.  All multiplications and divisions come down to
repeated calls to these.  `mpn_add_n', `mpn_sub_n', `mpn_lshift' and
`mpn_rshift' are next most important.

   On some CPUs assembly versions of the internal functions
`mpn_mul_basecase' and `mpn_sqr_basecase' give significant speedups,
mainly through avoiding function call overheads.  They can also
potentially make better use of a wide superscalar processor, as can
bigger primitives like `mpn_addmul_2' or `mpn_addmul_4'.

   The restrictions on overlaps between sources and destinations (*note
Low-level Functions::) are designed to facilitate a variety of
implementations.  For example, knowing `mpn_add_n' won't have partly
overlapping sources and destination means reading can be done far ahead
of writing on superscalar processors, and loops can be vectorized on a
vector processor, depending on the carry handling.


File: gmp.info,  Node: Assembly Carry Propagation,  Next: Assembly Cache Handling,  Prev: Assembly Basics,  Up: Assembly Coding

16.8.3 Carry Propagation
------------------------

The problem that presents most challenges in GMP is propagating carries
from one limb to the next.  In functions like `mpn_addmul_1' and
`mpn_add_n', carries are the only dependencies between limb operations.

   On processors with carry flags, a straightforward CISC style `adc' is
generally best.  AMD K6 `mpn_addmul_1' however is an example of an
unusual set of circumstances where a branch works out better.

   On RISC processors generally an add and compare for overflow is
used.  This sort of thing can be seen in `mpn/generic/aors_n.c'.  Some
carry propagation schemes require 4 instructions, meaning at least 4
cycles per limb, but other schemes may use just 1 or 2.  On wide
superscalar processors performance may be completely determined by the
number of dependent instructions between carry-in and carry-out for
each limb.

   On vector processors good use can be made of the fact that a carry
bit only very rarely propagates more than one limb.  When adding a
single bit to a limb, there's only a carry out if that limb was
`0xFF...FF' which on random data will be only 1 in 2^mp_bits_per_limb.
`mpn/cray/add_n.c' is an example of this, it adds all limbs in
parallel, adds one set of carry bits in parallel and then only rarely
needs to fall through to a loop propagating further carries.

   On the x86s, GCC (as of version 2.95.2) doesn't generate
particularly good code for the RISC style idioms that are necessary to
handle carry bits in C.  Often conditional jumps are generated where
`adc' or `sbb' forms would be better.  And so unfortunately almost any
loop involving carry bits needs to be coded in assembly for best
results.


File: gmp.info,  Node: Assembly Cache Handling,  Next: Assembly Functional Units,  Prev: Assembly Carry Propagation,  Up: Assembly Coding

16.8.4 Cache Handling
---------------------

GMP aims to perform well both on operands that fit entirely in L1 cache
and those which don't.

   Basic routines like `mpn_add_n' or `mpn_lshift' are often used on
large operands, so L2 and main memory performance is important for them.
`mpn_mul_1' and `mpn_addmul_1' are mostly used for multiply and square
basecases, so L1 performance matters most for them, unless assembly
versions of `mpn_mul_basecase' and `mpn_sqr_basecase' exist, in which
case the remaining uses are mostly for larger operands.

   For L2 or main memory operands, memory access times will almost
certainly be more than the calculation time.  The aim therefore is to
maximize memory throughput, by starting a load of the next cache line
while processing the contents of the previous one.  Clearly this is
only possible if the chip has a lock-up free cache or some sort of
prefetch instruction.  Most current chips have both these features.

   Prefetching sources combines well with loop unrolling, since a
prefetch can be initiated once per unrolled loop (or more than once if
the loop covers more than one cache line).

   On CPUs without write-allocate caches, prefetching destinations will
ensure individual stores don't go further down the cache hierarchy,
limiting bandwidth.  Of course for calculations which are slow anyway,
like `mpn_divrem_1', write-throughs might be fine.

   The distance ahead to prefetch will be determined by memory latency
versus throughput.  The aim of course is to have data arriving
continuously, at peak throughput.  Some CPUs have limits on the number
of fetches or prefetches in progress.

   If a special prefetch instruction doesn't exist then a plain load
can be used, but in that case care must be taken not to attempt to read
past the end of an operand, since that might produce a segmentation
violation.

   Some CPUs or systems have hardware that detects sequential memory
accesses and initiates suitable cache movements automatically, making
life easy.


File: gmp.info,  Node: Assembly Functional Units,  Next: Assembly Floating Point,  Prev: Assembly Cache Handling,  Up: Assembly Coding

16.8.5 Functional Units
-----------------------

When choosing an approach for an assembly loop, consideration is given
to what operations can execute simultaneously and what throughput can
thereby be achieved.  In some cases an algorithm can be tweaked to
accommodate available resources.

   Loop control will generally require a counter and pointer updates,
costing as much as 5 instructions, plus any delays a branch introduces.
CPU addressing modes might reduce pointer updates, perhaps by allowing
just one updating pointer and others expressed as offsets from it, or
on CISC chips with all addressing done with the loop counter as a
scaled index.

   The final loop control cost can be amortised by processing several
limbs in each iteration (*note Assembly Loop Unrolling::).  This at
least ensures loop control isn't a big fraction the work done.

   Memory throughput is always a limit.  If perhaps only one load or
one store can be done per cycle then 3 cycles/limb will the top speed
for "binary" operations like `mpn_add_n', and any code achieving that
is optimal.

   Integer resources can be freed up by having the loop counter in a
float register, or by pressing the float units into use for some
multiplying, perhaps doing every second limb on the float side (*note
Assembly Floating Point::).

   Float resources can be freed up by doing carry propagation on the
integer side, or even by doing integer to float conversions in integers
using bit twiddling.


File: gmp.info,  Node: Assembly Floating Point,  Next: Assembly SIMD Instructions,  Prev: Assembly Functional Units,  Up: Assembly Coding

16.8.6 Floating Point
---------------------

Floating point arithmetic is used in GMP for multiplications on CPUs
with poor integer multipliers.  It's mostly useful for `mpn_mul_1',
`mpn_addmul_1' and `mpn_submul_1' on 64-bit machines, and
`mpn_mul_basecase' on both 32-bit and 64-bit machines.

   With IEEE 53-bit double precision floats, integer multiplications
producing up to 53 bits will give exact results.  Breaking a 64x64
multiplication into eight 16x32->48 bit pieces is convenient.  With
some care though six 21x32->53 bit products can be used, if one of the
lower two 21-bit pieces also uses the sign bit.

   For the `mpn_mul_1' family of functions on a 64-bit machine, the
invariant single limb is split at the start, into 3 or 4 pieces.
Inside the loop, the bignum operand is split into 32-bit pieces.  Fast
conversion of these unsigned 32-bit pieces to floating point is highly
machine-dependent.  In some cases, reading the data into the integer
unit, zero-extending to 64-bits, then transferring to the floating
point unit back via memory is the only option.

   Converting partial products back to 64-bit limbs is usually best
done as a signed conversion.  Since all values are smaller than 2^53,
signed and unsigned are the same, but most processors lack unsigned
conversions.


   Here is a diagram showing 16x32 bit products for an `mpn_mul_1' or
`mpn_addmul_1' with a 64-bit limb.  The single limb operand V is split
into four 16-bit parts.  The multi-limb operand U is split in the loop
into two 32-bit parts.

                     +---+---+---+---+
                     |v48|v32|v16|v00|    V operand
                     +---+---+---+---+

                     +-------+---+---+
                 x   |  u32  |  u00  |    U operand (one limb)
                     +---------------+

     ---------------------------------

                         +-----------+
                         | u00 x v00 |    p00    48-bit products
                         +-----------+
                     +-----------+
                     | u00 x v16 |        p16
                     +-----------+
                 +-----------+
                 | u00 x v32 |            p32
                 +-----------+
             +-----------+
             | u00 x v48 |                p48
             +-----------+
                 +-----------+
                 | u32 x v00 |            r32
                 +-----------+
             +-----------+
             | u32 x v16 |                r48
             +-----------+
         +-----------+
         | u32 x v32 |                    r64
         +-----------+
     +-----------+
     | u32 x v48 |                        r80
     +-----------+

   p32 and r32 can be summed using floating-point addition, and
likewise p48 and r48.  p00 and p16 can be summed with r64 and r80 from
the previous iteration.

   For each loop then, four 49-bit quantities are transferred to the
integer unit, aligned as follows,

     |-----64bits----|-----64bits----|
                        +------------+
                        | p00 + r64' |    i00
                        +------------+
                    +------------+
                    | p16 + r80' |        i16
                    +------------+
                +------------+
                | p32 + r32  |            i32
                +------------+
            +------------+
            | p48 + r48  |                i48
            +------------+

   The challenge then is to sum these efficiently and add in a carry
limb, generating a low 64-bit result limb and a high 33-bit carry limb
(i48 extends 33 bits into the high half).


File: gmp.info,  Node: Assembly SIMD Instructions,  Next: Assembly Software Pipelining,  Prev: Assembly Floating Point,  Up: Assembly Coding

16.8.7 SIMD Instructions
------------------------

The single-instruction multiple-data support in current microprocessors
is aimed at signal processing algorithms where each data point can be
treated more or less independently.  There's generally not much support
for propagating the sort of carries that arise in GMP.

   SIMD multiplications of say four 16x16 bit multiplies only do as much
work as one 32x32 from GMP's point of view, and need some shifts and
adds besides.  But of course if say the SIMD form is fully pipelined
and uses less instruction decoding then it may still be worthwhile.

   On the x86 chips, MMX has so far found a use in `mpn_rshift' and
`mpn_lshift', and is used in a special case for 16-bit multipliers in
the P55 `mpn_mul_1'.  SSE2 is used for Pentium 4 `mpn_mul_1',
`mpn_addmul_1', and `mpn_submul_1'.


File: gmp.info,  Node: Assembly Software Pipelining,  Next: Assembly Loop Unrolling,  Prev: Assembly SIMD Instructions,  Up: Assembly Coding

16.8.8 Software Pipelining
--------------------------

Software pipelining consists of scheduling instructions around the
branch point in a loop.  For example a loop might issue a load not for
use in the present iteration but the next, thereby allowing extra
cycles for the data to arrive from memory.

   Naturally this is wanted only when doing things like loads or
multiplies that take several cycles to complete, and only where a CPU
has multiple functional units so that other work can be done in the
meantime.

   A pipeline with several stages will have a data value in progress at
each stage and each loop iteration moves them along one stage.  This is
like juggling.

   If the latency of some instruction is greater than the loop time
then it will be necessary to unroll, so one register has a result ready
to use while another (or multiple others) are still in progress.
(*note Assembly Loop Unrolling::).


File: gmp.info,  Node: Assembly Loop Unrolling,  Next: Assembly Writing Guide,  Prev: Assembly Software Pipelining,  Up: Assembly Coding

16.8.9 Loop Unrolling
---------------------

Loop unrolling consists of replicating code so that several limbs are
processed in each loop.  At a minimum this reduces loop overheads by a
corresponding factor, but it can also allow better register usage, for
example alternately using one register combination and then another.
Judicious use of `m4' macros can help avoid lots of duplication in the
source code.

   Any amount of unrolling can be handled with a loop counter that's
decremented by N each time, stopping when the remaining count is less
than the further N the loop will process.  Or by subtracting N at the
start, the termination condition becomes when the counter C is less
than 0 (and the count of remaining limbs is C+N).

   Alternately for a power of 2 unroll the loop count and remainder can
be established with a shift and mask.  This is convenient if also
making a computed jump into the middle of a large loop.

   The limbs not a multiple of the unrolling can be handled in various
ways, for example

   * A simple loop at the end (or the start) to process the excess.
     Care will be wanted that it isn't too much slower than the
     unrolled part.

   * A set of binary tests, for example after an 8-limb unrolling, test
     for 4 more limbs to process, then a further 2 more or not, and
     finally 1 more or not.  This will probably take more code space
     than a simple loop.

   * A `switch' statement, providing separate code for each possible
     excess, for example an 8-limb unrolling would have separate code
     for 0 remaining, 1 remaining, etc, up to 7 remaining.  This might
     take a lot of code, but may be the best way to optimize all cases
     in combination with a deep pipelined loop.

   * A computed jump into the middle of the loop, thus making the first
     iteration handle the excess.  This should make times smoothly
     increase with size, which is attractive, but setups for the jump
     and adjustments for pointers can be tricky and could become quite
     difficult in combination with deep pipelining.


File: gmp.info,  Node: Assembly Writing Guide,  Prev: Assembly Loop Unrolling,  Up: Assembly Coding

16.8.10 Writing Guide
---------------------

This is a guide to writing software pipelined loops for processing limb
vectors in assembly.

   First determine the algorithm and which instructions are needed.
Code it without unrolling or scheduling, to make sure it works.  On a
3-operand CPU try to write each new value to a new register, this will
greatly simplify later steps.

   Then note for each instruction the functional unit and/or issue port
requirements.  If an instruction can use either of two units, like U0
or U1 then make a category "U0/U1".  Count the total using each unit
(or combined unit), and count all instructions.

   Figure out from those counts the best possible loop time.  The goal
will be to find a perfect schedule where instruction latencies are
completely hidden.  The total instruction count might be the limiting
factor, or perhaps a particular functional unit.  It might be possible
to tweak the instructions to help the limiting factor.

   Suppose the loop time is N, then make N issue buckets, with the
final loop branch at the end of the last.  Now fill the buckets with
dummy instructions using the functional units desired.  Run this to
make sure the intended speed is reached.

   Now replace the dummy instructions with the real instructions from
the slow but correct loop you started with.  The first will typically
be a load instruction.  Then the instruction using that value is placed
in a bucket an appropriate distance down.  Run the loop again, to check
it still runs at target speed.

   Keep placing instructions, frequently measuring the loop.  After a
few you will need to wrap around from the last bucket back to the top
of the loop.  If you used the new-register for new-value strategy above
then there will be no register conflicts.  If not then take care not to
clobber something already in use.  Changing registers at this time is
very error prone.

   The loop will overlap two or more of the original loop iterations,
and the computation of one vector element result will be started in one
iteration of the new loop, and completed one or several iterations
later.

   The final step is to create feed-in and wind-down code for the loop.
A good way to do this is to make a copy (or copies) of the loop at the
start and delete those instructions which don't have valid antecedents,
and at the end replicate and delete those whose results are unwanted
(including any further loads).

   The loop will have a minimum number of limbs loaded and processed,
so the feed-in code must test if the request size is smaller and skip
either to a suitable part of the wind-down or to special code for small
sizes.


File: gmp.info,  Node: Internals,  Next: Contributors,  Prev: Algorithms,  Up: Top

17 Internals
************

*This chapter is provided only for informational purposes and the
various internals described here may change in future GMP releases.
Applications expecting to be compatible with future releases should use
only the documented interfaces described in previous chapters.*

* Menu:

* Integer Internals::
* Rational Internals::
* Float Internals::
* Raw Output Internals::
* C++ Interface Internals::


File: gmp.info,  Node: Integer Internals,  Next: Rational Internals,  Prev: Internals,  Up: Internals

17.1 Integer Internals
======================

`mpz_t' variables represent integers using sign and magnitude, in space
dynamically allocated and reallocated.  The fields are as follows.

`_mp_size'
     The number of limbs, or the negative of that when representing a
     negative integer.  Zero is represented by `_mp_size' set to zero,
     in which case the `_mp_d' data is unused.

`_mp_d'
     A pointer to an array of limbs which is the magnitude.  These are
     stored "little endian" as per the `mpn' functions, so `_mp_d[0]'
     is the least significant limb and `_mp_d[ABS(_mp_size)-1]' is the
     most significant.  Whenever `_mp_size' is non-zero, the most
     significant limb is non-zero.

     Currently there's always at least one limb allocated, so for
     instance `mpz_set_ui' never needs to reallocate, and `mpz_get_ui'
     can fetch `_mp_d[0]' unconditionally (though its value is then
     only wanted if `_mp_size' is non-zero).

`_mp_alloc'
     `_mp_alloc' is the number of limbs currently allocated at `_mp_d',
     and naturally `_mp_alloc >= ABS(_mp_size)'.  When an `mpz' routine
     is about to (or might be about to) increase `_mp_size', it checks
     `_mp_alloc' to see whether there's enough space, and reallocates
     if not.  `MPZ_REALLOC' is generally used for this.

   The various bitwise logical functions like `mpz_and' behave as if
negative values were twos complement.  But sign and magnitude is always
used internally, and necessary adjustments are made during the
calculations.  Sometimes this isn't pretty, but sign and magnitude are
best for other routines.

   Some internal temporary variables are setup with `MPZ_TMP_INIT' and
these have `_mp_d' space obtained from `TMP_ALLOC' rather than the
memory allocation functions.  Care is taken to ensure that these are
big enough that no reallocation is necessary (since it would have
unpredictable consequences).

   `_mp_size' and `_mp_alloc' are `int', although `mp_size_t' is
usually a `long'.  This is done to make the fields just 32 bits on some
64 bits systems, thereby saving a few bytes of data space but still
providing plenty of range.


File: gmp.info,  Node: Rational Internals,  Next: Float Internals,  Prev: Integer Internals,  Up: Internals

17.2 Rational Internals
=======================

`mpq_t' variables represent rationals using an `mpz_t' numerator and
denominator (*note Integer Internals::).

   The canonical form adopted is denominator positive (and non-zero),
no common factors between numerator and denominator, and zero uniquely
represented as 0/1.

   It's believed that casting out common factors at each stage of a
calculation is best in general.  A GCD is an O(N^2) operation so it's
better to do a few small ones immediately than to delay and have to do
a big one later.  Knowing the numerator and denominator have no common
factors can be used for example in `mpq_mul' to make only two cross
GCDs necessary, not four.

   This general approach to common factors is badly sub-optimal in the
presence of simple factorizations or little prospect for cancellation,
but GMP has no way to know when this will occur.  As per *Note
Efficiency::, that's left to applications.  The `mpq_t' framework might
still suit, with `mpq_numref' and `mpq_denref' for direct access to the
numerator and denominator, or of course `mpz_t' variables can be used
directly.


File: gmp.info,  Node: Float Internals,  Next: Raw Output Internals,  Prev: Rational Internals,  Up: Internals

17.3 Float Internals
====================

Efficient calculation is the primary aim of GMP floats and the use of
whole limbs and simple rounding facilitates this.

   `mpf_t' floats have a variable precision mantissa and a single
machine word signed exponent.  The mantissa is represented using sign
and magnitude.

        most                   least
     significant            significant
        limb                   limb

                                 _mp_d
      |---- _mp_exp --->           |
       _____ _____ _____ _____ _____
      |_____|_____|_____|_____|_____|
                        . <------------ radix point

       <-------- _mp_size --------->

The fields are as follows.

`_mp_size'
     The number of limbs currently in use, or the negative of that when
     representing a negative value.  Zero is represented by `_mp_size'
     and `_mp_exp' both set to zero, and in that case the `_mp_d' data
     is unused.  (In the future `_mp_exp' might be undefined when
     representing zero.)

`_mp_prec'
     The precision of the mantissa, in limbs.  In any calculation the
     aim is to produce `_mp_prec' limbs of result (the most significant
     being non-zero).

`_mp_d'
     A pointer to the array of limbs which is the absolute value of the
     mantissa.  These are stored "little endian" as per the `mpn'
     functions, so `_mp_d[0]' is the least significant limb and
     `_mp_d[ABS(_mp_size)-1]' the most significant.

     The most significant limb is always non-zero, but there are no
     other restrictions on its value, in particular the highest 1 bit
     can be anywhere within the limb.

     `_mp_prec+1' limbs are allocated to `_mp_d', the extra limb being
     for convenience (see below).  There are no reallocations during a
     calculation, only in a change of precision with `mpf_set_prec'.

`_mp_exp'
     The exponent, in limbs, determining the location of the implied
     radix point.  Zero means the radix point is just above the most
     significant limb.  Positive values mean a radix point offset
     towards the lower limbs and hence a value >= 1, as for example in
     the diagram above.  Negative exponents mean a radix point further
     above the highest limb.

     Naturally the exponent can be any value, it doesn't have to fall
     within the limbs as the diagram shows, it can be a long way above
     or a long way below.  Limbs other than those included in the
     `{_mp_d,_mp_size}' data are treated as zero.

   The `_mp_size' and `_mp_prec' fields are `int', although the
`mp_size_t' type is usually a `long'.  The `_mp_exp' field is usually
`long'.  This is done to make some fields just 32 bits on some 64 bits
systems, thereby saving a few bytes of data space but still providing
plenty of precision and a very large range.


The following various points should be noted.

Low Zeros
     The least significant limbs `_mp_d[0]' etc can be zero, though
     such low zeros can always be ignored.  Routines likely to produce
     low zeros check and avoid them to save time in subsequent
     calculations, but for most routines they're quite unlikely and
     aren't checked.

Mantissa Size Range
     The `_mp_size' count of limbs in use can be less than `_mp_prec' if
     the value can be represented in less.  This means low precision
     values or small integers stored in a high precision `mpf_t' can
     still be operated on efficiently.

     `_mp_size' can also be greater than `_mp_prec'.  Firstly a value is
     allowed to use all of the `_mp_prec+1' limbs a