ange Logs. (line 6) * character set: Character Set. (line 6) * command-line arguments, decoding: Semantics. (line 46) * command-line interface: Command-Line Interfaces. (line 6) * commenting: Comments. (line 6) * compatibility with C and POSIX standards: Compatibility. (line 6) * compiler warnings: Syntactic Conventions. (line 10) * conditional changes, and change logs: Conditional Changes. (line 6) * conditionals, comments for: Comments. (line 60) * configure: Configuration. (line 6) * control-L: Formatting. (line 118) * conventions for makefiles: Makefile Conventions. (line 6) * CORBA: Graphical Interfaces. (line 16) * credits for manuals: Manual Credits. (line 6) * D-bus: Graphical Interfaces. (line 16) * data types, and portability: CPU Portability. (line 6) * declaration for system functions: System Functions. (line 21) * DESTDIR: DESTDIR. (line 6) * documentation: Documentation. (line 6) * doschk: Names. (line 38) * downloading this manual: Preface. (line 14) * encodings: Character Set. (line 6) * error messages: Semantics. (line 19) * error messages, formatting: Errors. (line 6) * exec_prefix: Directory Variables. (line 36) * expressions, splitting: Formatting. (line 81) * FDL, GNU Free Documentation License: GNU Free Documentation License. (line 6) * file usage: File Usage. (line 6) * file-name limitations: Names. (line 38) * formatting error messages: Errors. (line 6) * formatting source code: Formatting. (line 6) * formfeed: Formatting. (line 118) * function argument, declaring: Syntactic Conventions. (line 6) * function prototypes: Standard C. (line 17) * getopt: Command-Line Interfaces. (line 6) * gettext: Internationalization. (line 6) * GNOME: Graphical Interfaces. (line 16) * GNOME and Guile: Source Language. (line 38) * gnustandards project repository: Preface. (line 30) * gnustandards-commit@gnu.org mailing list: Preface. (line 24) * graphical user interface: Graphical Interfaces. (line 6) * grave accent: Quote Characters. (line 6) * GTK+: Graphical Interfaces. (line 6) * Guile: Source Language. (line 38) * implicit int: Syntactic Conventions. (line 6) * impossible conditions: Semantics. (line 70) * installations, staged: DESTDIR. (line 6) * interface styles: Graphical Interfaces. (line 6) * internationalization: Internationalization. (line 6) * keyboard interface: Graphical Interfaces. (line 16) * LDAP: OID Allocations. (line 6) * left quote: Quote Characters. (line 6) * legal aspects: Legal Issues. (line 6) * legal papers: Contributions. (line 6) * libexecdir: Directory Variables. (line 67) * libraries: Libraries. (line 6) * library functions, and portability: System Functions. (line 6) * library interface: Graphical Interfaces. (line 16) * license for manuals: License for Manuals. (line 6) * lint: Syntactic Conventions. (line 109) * locale-specific quote characters: Quote Characters. (line 6) * long option names: Option Table. (line 6) * long-named options: Command-Line Interfaces. (line 12) * makefile, conventions for: Makefile Conventions. (line 6) * malloc return value: Semantics. (line 25) * man pages: Man Pages. (line 6) * manual structure: Manual Structure Details. (line 6) * memory allocation failure: Semantics. (line 25) * memory usage: Memory Usage. (line 6) * message text, and internationalization: Internationalization. (line 29) * mmap: Mmap. (line 6) * multiple variables in a line: Syntactic Conventions. (line 35) * names of variables, functions, and files: Names. (line 6) * NEWS file: NEWS File. (line 6) * non-ASCII characters: Character Set. (line 6) * non-POSIX systems, and portability: System Portability. (line 32) * non-standard extensions: Using Extensions. (line 6) * NUL characters: Semantics. (line 11) * OID allocations for GNU: OID Allocations. (line 6) * open brace: Formatting. (line 6) * optional features, configure-time: Configuration. (line 100) * options for compatibility: Compatibility. (line 14) * options, standard command-line: Command-Line Interfaces. (line 31) * output device and program's behavior: User Interfaces. (line 13) * packaging: Releases. (line 6) * PATH_INFO, specifying standard options as: Command-Line Interfaces. (line 31) * portability, and data types: CPU Portability. (line 6) * portability, and library functions: System Functions. (line 6) * portability, between system types: System Portability. (line 6) * POSIX compatibility: Compatibility. (line 6) * POSIXLY_CORRECT, environment variable: Compatibility. (line 21) * post-installation commands: Install Command Categories. (line 6) * pre-installation commands: Install Command Categories. (line 6) * prefix: Directory Variables. (line 26) * program configuration: Configuration. (line 6) * program design: Design Advice. (line 6) * program name and its behavior: User Interfaces. (line 6) * program's canonical name: --version. (line 12) * programming languages: Source Language. (line 6) * proprietary programs: Reading Non-Free Code. (line 6) * quote characters: Quote Characters. (line 6) * README file: Releases. (line 21) * references to non-free material: References. (line 6) * releasing: Managing Releases. (line 6) * Savannah repository for gnustandards: Preface. (line 30) * sbindir: Directory Variables. (line 60) * signal handling: Semantics. (line 59) * SNMP: OID Allocations. (line 6) * spaces before open-paren: Formatting. (line 75) * staged installs: DESTDIR. (line 6) * standard command-line options: Command-Line Interfaces. (line 31) * standards for makefiles: Makefile Conventions. (line 6) * string library functions: System Functions. (line 55) * syntactic conventions: Syntactic Conventions. (line 6) * table of long options: Option Table. (line 6) * temporary files: Semantics. (line 84) * temporary variables: Syntactic Conventions. (line 23) * texinfo.tex, in a distribution: Releases. (line 70) * TMPDIR environment variable: Semantics. (line 84) * trademarks: Trademarks. (line 6) * user interface styles: Graphical Interfaces. (line 6) * where to obtain standards.texi: Preface. (line 14) * X.509: OID Allocations. (line 6)  Tag Table: Node: Top814 Node: Preface2089 Node: Legal Issues4802 Node: Reading Non-Free Code5272 Node: Contributions7002 Node: Trademarks9240 Node: Design Advice10875 Node: Source Language11467 Node: Compatibility13593 Node: Using Extensions15221 Node: Standard C16797 Node: Conditional Compilation19200 Node: Program Behavior20598 Node: Non-GNU Standards21714 Node: Semantics23995 Node: Libraries28715 Node: Errors29960 Node: User Interfaces32453 Node: Graphical Interfaces34058 Node: Command-Line Interfaces35242 Node: --version37274 Node: --help43011 Node: Option Table43884 Node: OID Allocations58839 Node: Memory Usage60636 Node: File Usage61672 Node: Writing C62422 Node: Formatting63394 Node: Comments67683 Node: Syntactic Conventions71235 Node: Names74697 Node: System Portability76909 Node: CPU Portability79800 Node: System Functions83701 Node: Internationalization88898 Node: Character Set92892 Node: Quote Characters93705 Node: Mmap95225 Node: Documentation95933 Node: GNU Manuals97039 Node: Doc Strings and Manuals102777 Node: Manual Structure Details104330 Node: License for Manuals105748 Node: Manual Credits106722 Node: Printed Manuals107115 Node: NEWS File107801 Node: Change Logs108479 Node: Change Log Concepts109233 Node: Style of Change Logs111336 Node: Simple Changes113836 Node: Conditional Changes115278 Node: Indicating the Part Changed116700 Node: Man Pages117227 Node: Reading other Manuals119433 Node: Managing Releases120224 Node: Configuration121005 Node: Makefile Conventions129670 Node: Makefile Basics130552 Node: Utilities in Makefiles133726 Node: Command Variables135871 Node: DESTDIR139093 Node: Directory Variables141242 Node: Standard Targets155735 Ref: Standard Targets-Footnote-1169250 Node: Install Command Categories169350 Node: Releases173883 Node: References177888 Node: GNU Free Documentation License183735 Node: Index208902  End Tag Table This is ../../gmp/doc/gmp.info, produced by makeinfo version 4.8 from ../../gmp/doc/gmp.texi. This manual describes how to install and use the GNU multiple precision arithmetic library, version 5.0.1. Copyright 1991, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010 Free Software Foundation, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, with the Front-Cover Texts being "A GNU Manual", and with the Back-Cover Texts being "You have freedom to copy and modify this GNU Manual, like GNU software". A copy of the license is included in *Note GNU Free Documentation License::. INFO-DIR-SECTION GNU libraries START-INFO-DIR-ENTRY * gmp: (gmp). GNU Multiple Precision Arithmetic Library. END-INFO-DIR-ENTRY  File: gmp.info, Node: Powering Algorithms, Next: Root Extraction Algorithms, Prev: Greatest Common Divisor Algorithms, Up: Algorithms 16.4 Powering Algorithms ======================== * Menu: * Normal Powering Algorithm:: * Modular Powering Algorithm::  File: gmp.info, Node: Normal Powering Algorithm, Next: Modular Powering Algorithm, Prev: Powering Algorithms, Up: Powering Algorithms 16.4.1 Normal Powering ---------------------- Normal `mpz' or `mpf' powering uses a simple binary algorithm, successively squaring and then multiplying by the base when a 1 bit is seen in the exponent, as per Knuth section 4.6.3. The "left to right" variant described there is used rather than algorithm A, since it's just as easy and can be done with somewhat less temporary memory.  File: gmp.info, Node: Modular Powering Algorithm, Prev: Normal Powering Algorithm, Up: Powering Algorithms 16.4.2 Modular Powering ----------------------- Modular powering is implemented using a 2^k-ary sliding window algorithm, as per "Handbook of Applied Cryptography" algorithm 14.85 (*note References::). k is chosen according to the size of the exponent. Larger exponents use larger values of k, the choice being made to minimize the average number of multiplications that must supplement the squaring. The modular multiplies and squares use either a simple division or the REDC method by Montgomery (*note References::). REDC is a little faster, essentially saving N single limb divisions in a fashion similar to an exact remainder (*note Exact Remainder::).  File: gmp.info, Node: Root Extraction Algorithms, Next: Radix Conversion Algorithms, Prev: Powering Algorithms, Up: Algorithms 16.5 Root Extraction Algorithms =============================== * Menu: * Square Root Algorithm:: * Nth Root Algorithm:: * Perfect Square Algorithm:: * Perfect Power Algorithm::  File: gmp.info, Node: Square Root Algorithm, Next: Nth Root Algorithm, Prev: Root Extraction Algorithms, Up: Root Extraction Algorithms 16.5.1 Square Root ------------------ Square roots are taken using the "Karatsuba Square Root" algorithm by Paul Zimmermann (*note References::). An input n is split into four parts of k bits each, so with b=2^k we have n = a3*b^3 + a2*b^2 + a1*b + a0. Part a3 must be "normalized" so that either the high or second highest bit is set. In GMP, k is kept on a limb boundary and the input is left shifted (by an even number of bits) to normalize. The square root of the high two parts is taken, by recursive application of the algorithm (bottoming out in a one-limb Newton's method), s1,r1 = sqrtrem (a3*b + a2) This is an approximation to the desired root and is extended by a division to give s,r, q,u = divrem (r1*b + a1, 2*s1) s = s1*b + q r = u*b + a0 - q^2 The normalization requirement on a3 means at this point s is either correct or 1 too big. r is negative in the latter case, so if r < 0 then r = r + 2*s - 1 s = s - 1 The algorithm is expressed in a divide and conquer form, but as noted in the paper it can also be viewed as a discrete variant of Newton's method, or as a variation on the schoolboy method (no longer taught) for square roots two digits at a time. If the remainder r is not required then usually only a few high limbs of r and u need to be calculated to determine whether an adjustment to s is required. This optimization is not currently implemented. In the Karatsuba multiplication range this algorithm is O(1.5*M(N/2)), where M(n) is the time to multiply two numbers of n limbs. In the FFT multiplication range this grows to a bound of O(6*M(N/2)). In practice a factor of about 1.5 to 1.8 is found in the Karatsuba and Toom-3 ranges, growing to 2 or 3 in the FFT range. The algorithm does all its calculations in integers and the resulting `mpn_sqrtrem' is used for both `mpz_sqrt' and `mpf_sqrt'. The extended precision given by `mpf_sqrt_ui' is obtained by padding with zero limbs.  File: gmp.info, Node: Nth Root Algorithm, Next: Perfect Square Algorithm, Prev: Square Root Algorithm, Up: Root Extraction Algorithms 16.5.2 Nth Root --------------- Integer Nth roots are taken using Newton's method with the following iteration, where A is the input and n is the root to be taken. 1 A a[i+1] = - * ( --------- + (n-1)*a[i] ) n a[i]^(n-1) The initial approximation a[1] is generated bitwise by successively powering a trial root with or without new 1 bits, aiming to be just above the true root. The iteration converges quadratically when started from a good approximation. When n is large more initial bits are needed to get good convergence. The current implementation is not particularly well optimized.  File: gmp.info, Node: Perfect Square Algorithm, Next: Perfect Power Algorithm, Prev: Nth Root Algorithm, Up: Root Extraction Algorithms 16.5.3 Perfect Square --------------------- A significant fraction of non-squares can be quickly identified by checking whether the input is a quadratic residue modulo small integers. `mpz_perfect_square_p' first tests the input mod 256, which means just examining the low byte. Only 44 different values occur for squares mod 256, so 82.8% of inputs can be immediately identified as non-squares. On a 32-bit system similar tests are done mod 9, 5, 7, 13 and 17, for a total 99.25% of inputs identified as non-squares. On a 64-bit system 97 is tested too, for a total 99.62%. These moduli are chosen because they're factors of 2^24-1 (or 2^48-1 for 64-bits), and such a remainder can be quickly taken just using additions (see `mpn_mod_34lsub1'). When nails are in use moduli are instead selected by the `gen-psqr.c' program and applied with an `mpn_mod_1'. The same 2^24-1 or 2^48-1 could be done with nails using some extra bit shifts, but this is not currently implemented. In any case each modulus is applied to the `mpn_mod_34lsub1' or `mpn_mod_1' remainder and a table lookup identifies non-squares. By using a "modexact" style calculation, and suitably permuted tables, just one multiply each is required, see the code for details. Moduli are also combined to save operations, so long as the lookup tables don't become too big. `gen-psqr.c' does all the pre-calculations. A square root must still be taken for any value that passes these tests, to verify it's really a square and not one of the small fraction of non-squares that get through (ie. a pseudo-square to all the tested bases). Clearly more residue tests could be done, `mpz_perfect_square_p' only uses a compact and efficient set. Big inputs would probably benefit from more residue testing, small inputs might be better off with less. The assumed distribution of squares versus non-squares in the input would affect such considerations.  File: gmp.info, Node: Perfect Power Algorithm, Prev: Perfect Square Algorithm, Up: Root Extraction Algorithms 16.5.4 Perfect Power -------------------- Detecting perfect powers is required by some factorization algorithms. Currently `mpz_perfect_power_p' is implemented using repeated Nth root extractions, though naturally only prime roots need to be considered. (*Note Nth Root Algorithm::.) If a prime divisor p with multiplicity e can be found, then only roots which are divisors of e need to be considered, much reducing the work necessary. To this end divisibility by a set of small primes is checked.  File: gmp.info, Node: Radix Conversion Algorithms, Next: Other Algorithms, Prev: Root Extraction Algorithms, Up: Algorithms 16.6 Radix Conversion ===================== Radix conversions are less important than other algorithms. A program dominated by conversions should probably use a different data representation. * Menu: * Binary to Radix:: * Radix to Binary::  File: gmp.info, Node: Binary to Radix, Next: Radix to Binary, Prev: Radix Conversion Algorithms, Up: Radix Conversion Algorithms 16.6.1 Binary to Radix ---------------------- Conversions from binary to a power-of-2 radix use a simple and fast O(N) bit extraction algorithm. Conversions from binary to other radices use one of two algorithms. Sizes below `GET_STR_PRECOMPUTE_THRESHOLD' use a basic O(N^2) method. Repeated divisions by b^n are made, where b is the radix and n is the biggest power that fits in a limb. But instead of simply using the remainder r from such divisions, an extra divide step is done to give a fractional limb representing r/b^n. The digits of r can then be extracted using multiplications by b rather than divisions. Special case code is provided for decimal, allowing multiplications by 10 to optimize to shifts and adds. Above `GET_STR_PRECOMPUTE_THRESHOLD' a sub-quadratic algorithm is used. For an input t, powers b^(n*2^i) of the radix are calculated, until a power between t and sqrt(t) is reached. t is then divided by that largest power, giving a quotient which is the digits above that power, and a remainder which is those below. These two parts are in turn divided by the second highest power, and so on recursively. When a piece has been divided down to less than `GET_STR_DC_THRESHOLD' limbs, the basecase algorithm described above is used. The advantage of this algorithm is that big divisions can make use of the sub-quadratic divide and conquer division (*note Divide and Conquer Division::), and big divisions tend to have less overheads than lots of separate single limb divisions anyway. But in any case the cost of calculating the powers b^(n*2^i) must first be overcome. `GET_STR_PRECOMPUTE_THRESHOLD' and `GET_STR_DC_THRESHOLD' represent the same basic thing, the point where it becomes worth doing a big division to cut the input in half. `GET_STR_PRECOMPUTE_THRESHOLD' includes the cost of calculating the radix power required, whereas `GET_STR_DC_THRESHOLD' assumes that's already available, which is the case when recursing. Since the base case produces digits from least to most significant but they want to be stored from most to least, it's necessary to calculate in advance how many digits there will be, or at least be sure not to underestimate that. For GMP the number of input bits is multiplied by `chars_per_bit_exactly' from `mp_bases', rounding up. The result is either correct or one too big. Examining some of the high bits of the input could increase the chance of getting the exact number of digits, but an exact result every time would not be practical, since in general the difference between numbers 100... and 99... is only in the last few bits and the work to identify 99... might well be almost as much as a full conversion. `mpf_get_str' doesn't currently use the algorithm described here, it multiplies or divides by a power of b to move the radix point to the just above the highest non-zero digit (or at worst one above that location), then multiplies by b^n to bring out digits. This is O(N^2) and is certainly not optimal. The r/b^n scheme described above for using multiplications to bring out digits might be useful for more than a single limb. Some brief experiments with it on the base case when recursing didn't give a noticeable improvement, but perhaps that was only due  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~to the implementation. Something similar would work for the sub-quadratic divisions too, though there would be the cost of calculating a bigger radix power. Another possible improvement for the sub-quadratic part would be to arrange for radix powers that balanced the sizes of quotient and remainder produced, ie. the highest power would be an b^(n*k) approximately equal to sqrt(t), not restricted to a 2^i factor. That ought to smooth out a graph of times against sizes, but may or may not be a net speedup.  File: gmp.info, Node: Radix to Binary, Prev: Binary to Radix, Up: Radix Conversion Algorithms 16.6.2 Radix to Binary ---------------------- *This section needs to be rewritten, it currently describes the algorithms used before GMP 4.3.* Conversions from a power-of-2 radix into binary use a simple and fast O(N) bitwise concatenation algorithm. Conversions from other radices use one of two algorithms. Sizes below `SET_STR_PRECOMPUTE_THRESHOLD' use a basic O(N^2) method. Groups of n digits are converted to limbs, where n is the biggest power of the base b which will fit in a limb, then those groups are accumulated into the result by multiplying by b^n and adding. This saves multi-precision operations, as per Knuth section 4.4 part E (*note References::). Some special case code is provided for decimal, giving the compiler a chance to optimize multiplications by 10. Above `SET_STR_PRECOMPUTE_THRESHOLD' a sub-quadratic algorithm is used. First groups of n digits are converted into limbs. Then adjacent limbs are combined into limb pairs with x*b^n+y, where x and y are the limbs. Adjacent limb pairs are combined into quads similarly with x*b^(2n)+y. This continues until a single block remains, that being the result. The advantage of this method is that the multiplications for each x are big blocks, allowing Karatsuba and higher algorithms to be used. But the cost of calculating the powers b^(n*2^i) must be overcome. `SET_STR_PRECOMPUTE_THRESHOLD' usually ends up quite big, around 5000 digits, and on some processors much bigger still. `SET_STR_PRECOMPUTE_THRESHOLD' is based on the input digits (and tuned for decimal), though it might be better based on a limb count, so as to be independent of the base. But that sort of count isn't used by the base case and so would need some sort of initial calculation or estimate. The main reason `SET_STR_PRECOMPUTE_THRESHOLD' is so much bigger than the corresponding `GET_STR_PRECOMPUTE_THRESHOLD' is that `mpn_mul_1' is much faster than `mpn_divrem_1' (often by a factor of 5, or more).  File: gmp.info, Node: Other Algorithms, Next: Assembly Coding, Prev: Radix Conversion Algorithms, Up: Algorithms 16.7 Other Algorithms ===================== * Menu: * Prime Testing Algorithm:: * Factorial Algorithm:: * Binomial Coefficients Algorithm:: * Fibonacci Numbers Algorithm:: * Lucas Numbers Algorithm:: * Random Number Algorithms::  File: gmp.info, Node: Prime Testing Algorithm, Next: Factorial Algorithm, Prev: Other Algorithms, Up: Other Algorithms 16.7.1 Prime Testing -------------------- The primality testing in `mpz_probab_prime_p' (*note Number Theoretic Functions::) first does some trial division by small factors and then uses the Miller-Rabin probabilistic primality testing algorithm, as described in Knuth section 4.5.4 algorithm P (*note References::). For an odd input n, and with n = q*2^k+1 where q is odd, this algorithm selects a random base x and tests whether x^q mod n is 1 or -1, or an x^(q*2^j) mod n is 1, for 1<=j<=k. If so then n is probably prime, if not then n is definitely composite. Any prime n will pass the test, but some composites do too. Such composites are known as strong pseudoprimes to base x. No n is a strong pseudoprime to more than 1/4 of all bases (see Knuth exercise 22), hence with x chosen at random there's no more than a 1/4 chance a "probable prime" will in fact be composite. In fact strong pseudoprimes are quite rare, making the test much more powerful than this analysis would suggest, but 1/4 is all that's proven for an arbitrary n.  File: gmp.info, Node: Factorial Algorithm, Next: Binomial Coefficients Algorithm, Prev: Prime Testing Algorithm, Up: Other Algorithms 16.7.2 Factorial ---------------- Factorials are calculated by a combination of removal of twos, powering, and binary splitting. The procedure can be best illustrated with an example, 23! = 1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23 has factors of two removed, 23! = 2^19.1.1.3.1.5.3.7.1.9.5.11.3.13.7.15.1.17.9.19.5.21.11.23 and the resulting terms collected up according to their multiplicity, 23! = 2^19.(3.5)^3.(7.9.11)^2.(13.15.17.19.21.23) Each sequence such as 13.15.17.19.21.23 is evaluated by splitting into every second term, as for instance (13.17.21).(15.19.23), and the same recursively on each half. This is implemented iteratively using some bit twiddling. Such splitting is more efficient than repeated Nx1 multiplies since it forms big multiplies, allowing Karatsuba and higher algorithms to be used. And even below the Karatsuba threshold a big block of work can be more efficient for the basecase algorithm. Splitting into subsequences of every second term keeps the resulting products more nearly equal in size than would the simpler approach of say taking the first half and second half of the sequence. Nearly equal products are more efficient for the current multiply implementation.  File: gmp.info, Node: Binomial Coefficients Algorithm, Next: Fibonacci Numbers Algorithm, Prev: Factorial Algorithm, Up: Other Algorithms 16.7.3 Binomial Coefficients ---------------------------- Binomial coefficients C(n,k) are calculated by first arranging k <= n/2 using C(n,k) = C(n,n-k) if necessary, and then evaluating the following product simply from i=2 to i=k. k (n-k+i) C(n,k) = (n-k+1) * prod ------- i=2 i It's easy to show that each denominator i will divide the product so far, so the exact division algorithm is used (*note Exact Division::). The numerators n-k+i and denominators i are first accumulated into as many fit a limb, to save multi-precision operations, though for `mpz_bin_ui' this applies only to the divisors, since n is an `mpz_t' and n-k+i in general won't fit in a limb at all.  File: gmp.info, Node: Fibonacci Numbers Algorithm, Next: Lucas Numbers Algorithm, Prev: Binomial Coefficients Algorithm, Up: Other Algorithms 16.7.4 Fibonacci Numbers ------------------------ The Fibonacci functions `mpz_fib_ui' and `mpz_fib2_ui' are designed for calculating isolated F[n] or F[n],F[n-1] values efficiently. For small n, a table of single limb values in `__gmp_fib_table' is used. On a 32-bit limb this goes up to F[47], or on a 64-bit limb up to F[93]. For convenience the table starts at F[-1]. Beyond the table, values are generated with a binary powering algorithm, calculating a pair F[n] and F[n-1] working from high to low across the bits of n. The formulas used are F[2k+1] = 4*F[k]^2 - F[k-1]^2 + 2*(-1)^k F[2k-1] = F[k]^2 + F[k-1]^2 F[2k] = F[2k+1] - F[2k-1] At each step, k is the high b bits of n. If the next bit of n is 0 then F[2k],F[2k-1] is used, or if it's a 1 then F[2k+1],F[2k] is used, and the process repeated until all bits of n are incorporated. Notice these formulas require just two squares per bit of n. It'd be possible to handle the first few n above the single limb table with simple additions, using the defining Fibonacci recurrence F[k+1]=F[k]+F[k-1], but this is not done since it usually turns out to be faster for only about 10 or 20 values of n, and including a block of code for just those doesn't seem worthwhile. If they really mattered it'd be better to extend the data table. Using a table avoids lots of calculations on small numbers, and makes small n go fast. A bigger table would make more small n go fast, it's just a question of balancing size against desired speed. For GMP the code is kept compact, with the emphasis primarily on a good powering algorithm. `mpz_fib2_ui' returns both F[n] and F[n-1], but `mpz_fib_ui' is only interested in F[n]. In this case the last step of the algorithm can become one multiply instead of two squares. One of the following two formulas is used, according as n is odd or even. F[2k] = F[k]*(F[k]+2F[k-1]) F[2k+1] = (2F[k]+F[k-1])*(2F[k]-F[k-1]) + 2*(-1)^k F[2k+1] here is the same as above, just rearranged to be a multiply. For interest, the 2*(-1)^k term both here and above can be applied just to the low limb of the calculation, without a carry or borrow into further limbs, which saves some code size. See comments with `mpz_fib_ui' and the internal `mpn_fib2_ui' for how this is done.  File: gmp.info, Node: Lucas Numbers Algorithm, Next: Random Number Algorithms, Prev: Fibonacci Numbers Algorithm, Up: Other Algorithms 16.7.5 Lucas Numbers -------------------- `mpz_lucnum2_ui' derives a pair of Lucas numbers from a pair of Fibonacci numbers with the following simple formulas. L[k] = F[k] + 2*F[k-1] L[k-1] = 2*F[k] - F[k-1] `mpz_lucnum_ui' is only interested in L[n], and some work can be saved. Trailing zero bits on n can be handled with a single square each. L[2k] = L[k]^2 - 2*(-1)^k And the lowest 1 bit can be handled with one multiply of a pair of Fibonacci numbers, similar to what `mpz_fib_ui' does. L[2k+1] = 5*F[k-1]*(2*F[k]+F[k-1]) - 4*(-1)^k  File: gmp.info, Node: Random Number Algorithms, Prev: Lucas Numbers Algorithm, Up: Other Algorithms 16.7.6 Random Numbers --------------------- For the `urandomb' functions, random numbers are generated simply by concatenating bits produced by the generator. As long as the generator has good randomness properties this will produce well-distributed N bit numbers. For the `urandomm' functions, random numbers in a range 0<=R48 bit pieces is convenient. With some care though six 21x32->53 bit products can be used, if one of the lower two 21-bit pieces also uses the sign bit. For the `mpn_mul_1' family of functions on a 64-bit machine, the invariant single limb is split at the start, into 3 or 4 pieces. Inside the loop, the bignum operand is split into 32-bit pieces. Fast conversion of these unsigned 32-bit pieces to floating point is highly machine-dependent. In some cases, reading the data into the integer unit, zero-extending to 64-bits, then transferring to the floating point unit back via memory is the only option. Converting partial products back to 64-bit limbs is usually best done as a signed conversion. Since all values are smaller than 2^53, signed and unsigned are the same, but most processors lack unsigned conversions. Here is a diagram showing 16x32 bit products for an `mpn_mul_1' or `mpn_addmul_1' with a 64-bit limb. The single limb operand V is split into four 16-bit parts. The multi-limb operand U is split in the loop into two 32-bit parts. +---+---+---+---+ |v48|v32|v16|v00| V operand +---+---+---+---+ +-------+---+---+ x | u32 | u00 | U operand (one limb) +---------------+ --------------------------------- +-----------+ | u00 x v00 | p00 48-bit products +-----------+ +-----------+ | u00 x v16 | p16 +-----------+ +-----------+ | u00 x v32 | p32 +-----------+ +-----------+ | u00 x v48 | p48 +-----------+ +-----------+ | u32 x v00 | r32 +-----------+ +-----------+ | u32 x v16 | r48 +-----------+ +-----------+ | u32 x v32 | r64 +-----------+ +-----------+ | u32 x v48 | r80 +-----------+ p32 and r32 can be summed using floating-point addition, and likewise p48 and r48. p00 and p16 can be summed with r64 and r80 from the previous iteration. For each loop then, four 49-bit quantities are transferred to the integer unit, aligned as follows, |-----64bits----|-----64bits----| +------------+ | p00 + r64' | i00 +------------+ +------------+ | p16 + r80' | i16 +------------+ +------------+ | p32 + r32 | i32 +------------+ +------------+ | p48 + r48 | i48 +------------+ The challenge then is to sum these efficiently and add in a carry limb, generating a low 64-bit result limb and a high 33-bit carry limb (i48 extends 33 bits into the high half).  File: gmp.info, Node: Assembly SIMD Instructions, Next: Assembly Software Pipelining, Prev: Assembly Floating Point, Up: Assembly Coding 16.8.7 SIMD Instructions ------------------------ The single-instruction multiple-data support in current microprocessors is aimed at signal processing algorithms where each data point can be treated more or less independently. There's generally not much support for propagating the sort of carries that arise in GMP. SIMD multiplications of say four 16x16 bit multiplies only do as much work as one 32x32 from GMP's point of view, and need some shifts and adds besides. But of course if say the SIMD form is fully pipelined and uses less instruction decoding then it may still be worthwhile. On the x86 chips, MMX has so far found a use in `mpn_rshift' and `mpn_lshift', and is used in a special case for 16-bit multipliers in the P55 `mpn_mul_1'. SSE2 is used for Pentium 4 `mpn_mul_1', `mpn_addmul_1', and `mpn_submul_1'.  File: gmp.info, Node: Assembly Software Pipelining, Next: Assembly Loop Unrolling, Prev: Assembly SIMD Instructions, Up: Assembly Coding 16.8.8 Software Pipelining -------------------------- Software pipelining consists of scheduling instructions around the branch point in a loop. For example a loop might issue a load not for use in the present iteration but the next, thereby allowing extra cycles for the data to arrive from memory. Naturally this is wanted only when doing things like loads or multiplies that take several cycles to complete, and only where a CPU has multiple functional units so that other work can be done in the meantime. A pipeline with several stages will have a data value in progress at each stage and each loop iteration moves them along one stage. This is like juggling. If the latency of some instruction is greater than the loop time then it will be necessary to unroll, so one register has a result ready to use while another (or multiple others) are still in progress. (*note Assembly Loop Unrolling::).  File: gmp.info, Node: Assembly Loop Unrolling, Next: Assembly Writing Guide, Prev: Assembly Software Pipelining, Up: Assembly Coding 16.8.9 Loop Unrolling --------------------- Loop unrolling consists of replicating code so that several limbs are processed in each loop. At a minimum this reduces loop overheads by a corresponding factor, but it can also allow better register usage, for example alternately using one register combination and then another. Judicious use of `m4' macros can help avoid lots of duplication in the source code. Any amount of unrolling can be handled with a loop counter that's decremented by N each time, stopping when the remaining count is less than the further N the loop will process. Or by subtracting N at the start, the termination condition becomes when the counter C is less than 0 (and the count of remaining limbs is C+N). Alternately for a power of 2 unroll the loop count and remainder can be established with a shift and mask. This is convenient if also making a computed jump into the middle of a large loop. The limbs not a multiple of the unrolling can be handled in various ways, for example * A simple loop at the end (or the start) to process the excess. Care will be wanted that it isn't too much slower than the unrolled part. * A set of binary tests, for example after an 8-limb unrolling, test for 4 more limbs to process, then a further 2 more or not, and finally 1 more or not. This will probably take more code space than a simple loop. * A `switch' statement, providing separate code for each possible excess, for example an 8-limb unrolling would have separate code for 0 remaining, 1 remaining, etc, up to 7 remaining. This might take a lot of code, but may be the best way to optimize all cases in combination with a deep pipelined loop. * A computed jump into the middle of the loop, thus making the first iteration handle the excess. This should make times smoothly increase with size, which is attractive, but setups for the jump and adjustments for pointers can be tricky and could become quite difficult in combination with deep pipelining.  File: gmp.info, Node: Assembly Writing Guide, Prev: Assembly Loop Unrolling, Up: Assembly Coding 16.8.10 Writing Guide --------------------- This is a guide to writing software pipelined loops for processing limb vectors in assembly. First determine the algorithm and which instructions are needed. Code it without unrolling or scheduling, to make sure it works. On a 3-operand CPU try to write each new value to a new register, this will greatly simplify later steps. Then note for each instruction the functional unit and/or issue port requirements. If an instruction can use either of two units, like U0 or U1 then make a category "U0/U1". Count the total using each unit (or combined unit), and count all instructions. Figure out from those counts the best possible loop time. The goal will be to find a perfect schedule where instruction latencies are completely hidden. The total instruction count might be the limiting factor, or perhaps a particular functional unit. It might be possible to tweak the instructions to help the limiting factor. Suppose the loop time is N, then make N issue buckets, with the final loop branch at the end of the last. Now fill the buckets with dummy instructions using the functional units desired. Run this to make sure the intended speed is reached. Now replace the dummy instructions with the real instructions from the slow but correct loop you started with. The first will typically be a load instruction. Then the instruction using that value is placed in a bucket an appropriate distance down. Run the loop again, to check it still runs at target speed. Keep placing instructions, frequently measuring the loop. After a few you will need to wrap around from the last bucket back to the top of the loop. If you used the new-register for new-value strategy above then there will be no register conflicts. If not then take care not to clobber something already in use. Changing registers at this time is very error prone. The loop will overlap two or more of the original loop iterations, and the computation of one vector element result will be started in one iteration of the new loop, and completed one or several iterations later. The final step is to create feed-in and wind-down code for the loop. A good way to do this is to make a copy (or copies) of the loop at the start and delete those instructions which don't have valid antecedents, and at the end replicate and delete those whose results are unwanted (including any further loads). The loop will have a minimum number of limbs loaded and processed, so the feed-in code must test if the request size is smaller and skip either to a suitable part of the wind-down or to special code for small sizes.  File: gmp.info, Node: Internals, Next: Contributors, Prev: Algorithms, Up: Top 17 Internals ************ *This chapter is provided only for informational purposes and the various internals described here may change in future GMP releases. Applications expecting to be compatible with future releases should use only the documented interfaces described in previous chapters.* * Menu: * Integer Internals:: * Rational Internals:: * Float Internals:: * Raw Output Internals:: * C++ Interface Internals::  File: gmp.info, Node: Integer Internals, Next: Rational Internals, Prev: Internals, Up: Internals 17.1 Integer Internals ====================== `mpz_t' variables represent integers using sign and magnitude, in space dynamically allocated and reallocated. The fields are as follows. `_mp_size' The number of limbs, or the negative of that when representing a negative integer. Zero is represented by `_mp_size' set to zero, in which case the `_mp_d' data is unused. `_mp_d' A pointer to an array of limbs which is the magnitude. These are stored "little endian" as per the `mpn' functions, so `_mp_d[0]' is the least significant limb and `_mp_d[ABS(_mp_size)-1]' is the most significant. Whenever `_mp_size' is non-zero, the most significant limb is non-zero. Currently there's always at least one limb allocated, so for instance `mpz_set_ui' never needs to reallocate, and `mpz_get_ui' can fetch `_mp_d[0]' unconditionally (though its value is then only wanted if `_mp_size' is non-zero). `_mp_alloc' `_mp_alloc' is the number of limbs currently allocated at `_mp_d', and naturally `_mp_alloc >= ABS(_mp_size)'. When an `mpz' routine is about to (or might be about to) increase `_mp_size', it checks `_mp_alloc' to see whether there's enough space, and reallocates if not. `MPZ_REALLOC' is generally used for this. The various bitwise logical functions like `mpz_and' behave as if negative values were twos complement. But sign and magnitude is always used internally, and necessary adjustments are made during the calculations. Sometimes this isn't pretty, but sign and magnitude are best for other routines. Some internal temporary variables are setup with `MPZ_TMP_INIT' and these have `_mp_d' space obtained from `TMP_ALLOC' rather than the memory allocation functions. Care is taken to ensure that these are big enough that no reallocation is necessary (since it would have unpredictable consequences). `_mp_size' and `_mp_alloc' are `int', although `mp_size_t' is usually a `long'. This is done to make the fields just 32 bits on some 64 bits systems, thereby saving a few bytes of data space but still providing plenty of range.  File: gmp.info, Node: Rational Internals, Next: Float Internals, Prev: Integer Internals, Up: Internals 17.2 Rational Internals ======================= `mpq_t' variables represent rationals using an `mpz_t' numerator and denominator (*note Integer Internals::). The canonical form adopted is denominator positive (and non-zero), no common factors between numerator and denominator, and zero uniquely represented as 0/1. It's believed that casting out common factors at each stage of a calculation is best in general. A GCD is an O(N^2) operation so it's better to do a few small ones immediately than to delay and have to do a big one later. Knowing the numerator and denominator have no common factors can be used for example in `mpq_mul' to make only two cross GCDs necessary, not four. This general approach to common factors is badly sub-optimal in the presence of simple factorizations or little prospect for cancellation, but GMP has no way to know when this will occur. As per *Note Efficiency::, that's left to applications. The `mpq_t' framework might still suit, with `mpq_numref' and `mpq_denref' for direct access to the numerator and denominator, or of course `mpz_t' variables can be used directly.  File: gmp.info, Node: Float Internals, Next: Raw Output Internals, Prev: Rational Internals, Up: Internals 17.3 Float Internals ==================== Efficient calculation is the primary aim of GMP floats and the use of whole limbs and simple rounding facilitates this. `mpf_t' floats have a variable precision mantissa and a single machine word signed exponent. The mantissa is represented using sign and magnitude. most least significant significant limb limb _mp_d |---- _mp_exp ---> | _____ _____ _____ _____ _____ |_____|_____|_____|_____|_____| . <------------ radix point <-------- _mp_size ---------> The fields are as follows. `_mp_size' The number of limbs currently in use, or the negative of that when representing a negative value. Zero is represented by `_mp_size' and `_mp_exp' both set to zero, and in that case the `_mp_d' data is unused. (In the future `_mp_exp' might be undefined when representing zero.) `_mp_prec' The precision of the mantissa, in limbs. In any calculation the aim is to produce `_mp_prec' limbs of result (the most significant being non-zero). `_mp_d' A pointer to the array of limbs which is the absolute value of the mantissa. These are stored "little endian" as per the `mpn' functions, so `_mp_d[0]' is the least significant limb and `_mp_d[ABS(_mp_size)-1]' the most significant. The most significant limb is always non-zero, but there are no other restrictions on its value, in particular the highest 1 bit can be anywhere within the limb. `_mp_prec+1' limbs are allocated to `_mp_d', the extra limb being for convenience (see below). There are no reallocations during a calculation, only in a change of precision with `mpf_set_prec'. `_mp_exp' The exponent, in limbs, determining the location of the implied radix point. Zero means the radix point is just above the most significant limb. Positive values mean a radix point offset towards the lower limbs and hence a value >= 1, as for example in the diagram above. Negative exponents mean a radix point further above the highest limb. Naturally the exponent can be any value, it doesn't have to fall within the limbs as the diagram shows, it can be a long way above or a long way below. Limbs other than those included in the `{_mp_d,_mp_size}' data are treated as zero. The `_mp_size' and `_mp_prec' fields are `int', although the `mp_size_t' type is usually a `long'. The `_mp_exp' field is usually `long'. This is done to make some fields just 32 bits on some 64 bits systems, thereby saving a few bytes of data space but still providing plenty of precision and a very large range. The following various points should be noted. Low Zeros The least significant limbs `_mp_d[0]' etc can be zero, though such low zeros can always be ignored. Routines likely to produce low zeros check and avoid them to save time in subsequent calculations, but for most routines they're quite unlikely and aren't checked. Mantissa Size Range The `_mp_size' count of limbs in use can be less than `_mp_prec' if the value can be represented in less. This means low precision values or small integers stored in a high precision `mpf_t' can still be operated on efficiently. `_mp_size' can also be greater than `_mp_prec'. Firstly a value is allowed to use all of the `_mp_prec+1' limbs a