Even faster integer multiplication

Let

denote the cost of multiplying two

-bit integers in the deterministic multitape Turing model [38] (commonly called “bit complexity”). Previously, the best known asymptotic bound for

was due to Fürer [18, 19]. He proved that there is a constant

such that

The main contribution of this paper is a new algorithm that yields the following improvement.

Fürer suggested several methods to minimise the value of

in his algorithm, but did not give an explicit bound for

. In section 7 of this paper, we outline an optimised variant of Fürer's algorithm that achieves

. We do not know how to obtain

using Fürer's approach. This suggests that the new algorithm is faster than Fürer's by a factor of

The idea of the new algorithm is remarkably simple. Given two

-bit integers, we split them into chunks of exponentially smaller size, say around

bits, and thus reduce to the problem of multiplying integer polynomials of degree

with coefficients of bit size

. We multiply the polynomials using discrete Fourier transforms (DFTs) over

, with a working precision of

bits. To compute the DFTs, we decompose them into “short transforms” of exponentially smaller length, say length around

, using the Cooley–Tukey method. We then use Bluestein's chirp transform to convert each short transform into a polynomial multiplication problem over

, and finally convert back to integer multiplication via Kronecker substitution. These much smaller integer multiplications are handled recursively.

The algorithm just sketched leads immediately to a bound of the form (1.1). A detailed proof is given in section 4. We emphasise that the new method works directly over

, and does not need special coefficient rings with “fast” roots of unity, of the type constructed by Fürer. Optimising parameters and keeping careful track of constants leads to Theorem 1.1, which is proved in section 6. We also prove the following conditional result in section 9.

Conjecture 9.1 is a slight weakening of the Lenstra–Pomerance–Wagstaff conjecture on the distribution of Mersenne primes, i.e., primes of the form

. The idea of the algorithm is to replace the coefficient ring

by the finite field

; we are then able to exploit fast algorithms for multiplication modulo numbers of the form

An important feature of the new algorithms is that the same techniques are applicable in other contexts, such as polynomial multiplication over finite fields. Previously, no Fürer-type complexity bounds were known for the latter problem. The details are presented in the companion paper [24].

In the remainder of this section, we present a brief history of complexity bounds for integer multiplication, and we give an overview of the paper and of our contribution. More historical details can be found in books such as [21, Chapter 8].

1.1.Brief history and related work

Multiplication algorithms of complexity

in the number of digits

were already known in ancient civilisations. The Egyptians used an algorithm based on repeated doublings and additions. The Babylonians invented the positional numbering system, while performing their computations in base

instead of

. Precise descriptions of multiplication methods close to the ones that we learn at school appeared in Europe during the late Middle Ages. For historical references, we refer to [49, Section II.5] and [37, 5].

The first subquadratic algorithm for integer multiplication, with complexity

, was discovered by Karatsuba [30, 31]. From a modern viewpoint, Karatsuba's algorithm utilises an evaluation-interpolation scheme. The input integers are cut into smaller chunks, which are taken to be the coefficients of two integer polynomials; the polynomials are evaluated at several well-chosen points; their values at those points are (recursively) multiplied; interpolating the results at those points yields the product polynomial; finally, the integer product is recovered by pasting together the coefficients of the product polynomial. This cutting-and-pasting procedure is sometimes known as Kronecker segmentation (see section 2.6).

Shortly after the discovery of Karatsuba's algorithm, which uses three evaluation points, Toom generalised it so as to use

evaluation points instead [51, 50], for any

. This leads to the bound

for fixed

. Letting

grow slowly with

, he also showed that

. The algorithm was adapted to the Turing model by Cook [10] and is now known as Toom–Cook multiplication. Schönhage obtained a slightly better bound [45] by working modulo several numbers of the form

instead of using several polynomial evaluation points. Knuth proved that an even better complexity bound could be achieved by suitably adapting Toom's method [32].

The next step towards even faster integer multiplication was the rediscovery of the fast Fourier transform (FFT) by Cooley and Tukey [11] (essentially the same algorithm was already known to Gauss [27]). The FFT yields particularly efficient algorithms for evaluating and interpolating polynomials on certain special sets of evaluation points. For example, if

is a ring in which

is invertible, and if

is a principal

-th root of unity (see section 2.2 for detailed definitions), then the FFT permits evaluation and interpolation at the points

using only

ring operations in

. Consequently, if

and

are polynomials in

whose product has degree less than

, then the product

can be computed using

ring operations as well.

In [47], Schönhage and Strassen presented two FFT-based algorithms for integer multiplication. In both algorithms, they first use Kronecker segmentation to convert the problem to multiplication of integer polynomials. They then embed these polynomials into

for a suitable ring

and multiply the polynomials by using FFTs over

. The first algorithm takes

and

, and works with finite-precision approximations to elements of

. Multiplications in

itself are handled recursively, by treating them as integer multiplications (after appropriate scaling). The second algorithm, popularly known as the Schönhage–Strassen algorithm, takes

where

is a Fermat number. This algorithm is the faster of the two, achieving the bound

. It benefits from the fact that

is a principal

-th root of unity in

, and that multiplications by powers of

can be carried out efficiently, as they correspond to simple shifts and negations. At around the same time, Pollard pointed out that one can also work with

where

is a prime of the form

, since then

contains primitive

-th roots of unity [39] (although he did not give a bound for

Schönhage and Strassen's algorithm remained the champion for more than thirty years, but was recently superseded by Fürer's algorithm [18]. In short, Fürer managed to combine the advantages of the two algorithms from [47], to achieve the bound

. Fürer's algorithm is based on the ingenious observation that the ring

contains a small number of “fast” principal

-th roots of unity, namely the powers of

, but also a large supply of much higher-order roots of unity inherited from

. To evaluate an FFT over

, he decomposes it into many “short” transforms of length at most

, using the Cooley–Tukey method. He evaluates the short transforms with the fast roots of unity, pausing occasionally to perform “slow” multiplications by higher-order roots of unity (“twiddle factors”). A slightly subtle point of the construction is that we really need, for large

, a principal

-th root of unity

such that

In [15] it was shown that the technique from [39] to compute modulo suitable prime numbers of the form

can be adapted to Fürer's algorithm. Although the complexity of this algorithm is essentially the same as that of Fürer's algorithm, this method has the advantage that it does not require any error analysis for approximate numerical operations in

1.2.Our contributions and outline of the paper

Throughout the paper, integers are assumed to be handled in the standard binary representation. For our computational complexity results, we assume that we work on a Turing machine with a finite but sufficiently large number of tapes [38]. The Turing machine model is very conservative with respect to the cost of memory access, which is pertinent from a practical point of view for implementations of FFT algorithms. Nevertheless, other models for sequential computations could be considered [46, 20]. For practical purposes, parallel models might be more appropriate, but we will not consider these in this paper. Occasionally, for polynomial arithmetic over abstract rings, we will also consider algebraic complexity measures [8, Chapter 4].

In section 2, we start by recalling several classical techniques for completeness and later use: sorting and array transposition algorithms, discrete Fourier transforms (DFTs), the Cooley–Tukey algorithm, FFT multiplication and convolution, Bluestein's chirp transform, and Kronecker substitution and segmentation. In section 3, we also provide the necessary tools for the error analysis of complex Fourier transforms. Most of these tools are standard, although our presentation is somewhat ad hoc, being based on fixed point arithmetic.

In section 4, we describe a simplified version of the new integer multiplication algorithm, without any attempt to minimise the aforementioned constant

. As mentioned in the sketch above, the key idea is to reduce a given DFT over

to a collection of “short” transforms, and then to convert these short transforms back to integer multiplication by a combination of Bluestein's chirp transform and Kronecker substitution.

The complexity analysis of Fürer's algorithm and the algorithm from section 4 involves functional inequalities which contain post-compositions with logarithms and other slowly growing functions. In section 5, we present a few systematic tools for analysing these types of inequalities. For more information on this quite particular kind of asymptotic analysis, we refer the reader to [44, 16].

In section 6, we present an optimised version of the algorithm from section 4, proving in particular the bound

(Theorem 1.1), which constitutes the main result of this paper. In section 7, we outline a similar complexity analysis for Fürer's algorithm. Even after several optimisations of the original algorithm, we were unable to attain a bound better than

. This suggests that the new algorithm outperforms Fürer's algorithm by a factor of

This speedup is surprising, given that the short transforms in Fürer's algorithm involve only shifts, additions and subtractions. The solution to the paradox is that Fürer has made the short transforms too fast. Indeed, they are so fast that they make a negligible contribution to the overall complexity, and his computation is dominated by the “slow” twiddle factor multiplications. In the new algorithm, we push more work into the short transforms, allowing them to get slightly slower; the quid pro quo is that we avoid the factor of two in zero-padding caused by Fürer's introduction of artificial “fast” roots of unity. The optimal strategy is actually to let the short transforms dominate the computation, by increasing the short transform length relative to the coefficient size. Fürer is unable to do this, because in his algorithm these two parameters are too closely linked. To underscore just how far the situation has been inverted relative to Fürer's algorithm, we point out that in our presentation we can get away with using Schönhage–Strassen for the twiddle factor multiplications, without any detrimental effect on the overall complexity.

We have chosen to base most of our algorithms on approximate complex arithmetic. Instead, following [39] and [15], we might have chosen to use modular arithmetic. In section 8, we will briefly indicate how our main algorithm can be adapted to this setting. This variant of our algorithm presents several analogies with its adaptation to polynomial multiplication over finite fields [24].

The question remains whether there exists an even faster algorithm than the algorithm of section 6. In an earlier paper [17], Fürer gave another algorithm of complexity

under the assumption that there exist sufficiently many Fermat primes, i.e., primes of the form

. It can be shown that a careful optimisation of this algorithm yields the bound

. Unfortunately, odds are high that

is the largest Fermat prime. In section 9, we present an algorithm that achieves the bound

under the more plausible conjecture that there exist sufficiently many Mersenne primes (Theorem 1.2). The main technical ingredient is a variant of an algorithm of Crandall and Fagin [12] that permits efficient multiplication modulo

, despite

not being divisible by a large power of two.

It would be interesting to know whether the new algorithms could be useful in practice. We have implemented an unoptimised version of the algorithm from section 8 in the Mathemagix system [29] and found our implementation to be an order of magnitude slower than the Gmp library [23]. There is certainly room for improvement, but we doubt that even a highly optimised implementation of the new algorithm will be competitive in the near future. Nevertheless, the variant for polynomial multiplication over finite fields presented in [24] seems to be a promising avenue for achieving speedups in practical computations. This will be investigated in a forthcoming paper.

Notations. We use Hardy's notations

for

, and

for

and

. The symbol

denotes the set of non-negative real numbers, and

denotes

. We will write

2.Survey of classical tools

This section recalls basic facts on Fourier transforms and related techniques used in subsequent sections. For more details and historical references we refer the reader to standard books on the subject such as [2, 8, 21, 42].

2.1.Arrays and sorting

In the Turing model, we have available a fixed number of linear tapes. An

array

-bit elements is stored as a linear array of

bits. We generally assume that the elements are ordered lexicographically by

, though this is just an implementation detail.

What is significant from a complexity point of view is that occasionally we must switch representations, to access an array (say 2-dimensional) by “rows” or by “columns”. In the Turing model, we may transpose an

matrix of

-bit elements in time

, using the algorithm of [4, Appendix]. Briefly, the idea is to split the matrix into two halves along the “short” dimension, and transpose each half recursively.

We will also require more complex rearrangements of data, for which we resort to sorting. Suppose that

is a totally ordered set, whose elements are represented by bit strings of length

, and suppose that we can compare elements of

in time

. Then an array of

elements of

may be sorted in time

using merge sort [33], which can be implemented efficiently on a Turing machine.

2.2.Discrete Fourier transforms

Let

be a commutative ring with identity and let

. An element

is said to be a principal -th root of unity if

and

for all

. In this case, we define the discrete Fourier transform (or DFT) of an

-tuple

with respect to

to be

where

Remark 2.1. In all of the new algorithms introduced in this paper, we actually work over a field, whose characteristic does not divide

. In this setting, the concept of principal root of unity coincides with the more familiar primitive root of unity. The more general “principal root” concept is only needed for discussions of other algorithms, such as the Schönhage–Strassen algorithm or Fürer's algorithm.

2.3.The Cooley–Tukey FFT

Let

be a principal

-th root of unity and let

where

. Then

is a principal

-th root of unity and

is a principal

-th root of unity. Moreover, for any

and

, we have

and

are algorithms for computing DFTs of length

and

, we may use (2.2) to construct an algorithm

for computing DFTs of length

as follows.

For each

, the sum inside the brackets corresponds to the

-th coefficient of a DFT of the

-tuple

with respect to

. Evaluating these inner DFTs requires

calls to

. Next, we multiply by the twiddle factors

, at a cost of

operations in

. (Actually, fewer than

multiplications are required, as some of the twiddle factors are equal to

. This optimisation, while important in practice, has no asymptotic effect on the algorithms discussed in this paper.) Finally, for each

, the outer sum corresponds to the

-th coefficient of a DFT of an

-tuple in

with respect to

. These outer DFTs require

calls to

Denoting by

the number of ring operations needed to compute a DFT of length

, and assuming that we have available a precomputed table of twiddle factors, we obtain

The corresponding algorithm is denoted

. The

operation is neither commutative nor associative; the above expression will always be taken to mean

Let

be the butterfly algorithm that computes a DFT of length 2 by the formula

. Then

computes a DFT of length

in time

. Algorithms of this type are called fast Fourier transforms (or FFTs).

The above discussion requires several modifications in the Turing model. Assume that elements of

are represented by

bits.

First, for

, we must add a rearrangement cost of

to efficiently access the rows and columns for the recursive subtransforms (see section 2.1). For the general case

, the total rearrangement cost is bounded by

Second, we will sometimes use non-algebraic algorithms to compute the subtransforms, so it may not make sense to express their cost in terms of

. The relation (2.3) therefore becomes

where

is the (Turing) cost of a transform of length

over

, and where

is the cost of a single multiplication in

Finally, we point out that

requires access to a table of twiddle factors

, ordered lexicographically by

, for

. Assuming that we are given as input a precomputed table of the form

, we must show how to extract the required twiddle factor table in the correct order. We first construct a list of triples

, ordered by

, in time

; then sort by

in time

(see section 2.1); then merge with the given root table to obtain a table

, ordered by

, in time

; and finally sort by

in time

. The total cost of the extraction is thus

The corresponding cost for

is determined as follows. Assuming that the table

is given as input, we first extract the subtables of (

)-th roots of unity for

in time

. Extracting the twiddle factor table for the decomposition

then costs

; the total over all

is again

Remark 2.2. An alternative approach is to compute the twiddle factors directly in the correct order. When working over

, as in section 3, this requires a slight increase in the working precision. Similar comments apply to the root tables used in Bluestein's algorithm in section 2.5.

2.4.Fast Fourier multiplication

Let

be a principal

-th root of unity in

and assume that

is invertible in

. Consider two polynomials

and

. Let

be the polynomial defined by

where the product of the DFTs is taken pointwise. By construction, we have

, which means that

for all

. The product

and

modulo

also satisfies

for all

. Consequently,

, whence

For polynomials

with

and

, we thus obtain an algorithm for the computation of

modulo

using at most

operations in

. Modular products of this type are also called cyclic convolutions. If

, then we may recover the product

from its reduction modulo

. This multiplication method is called FFT multiplication.

If one of the arguments (say

) is fixed and we want to compute many products

(or cyclic convolutions) for different

, then we may precompute

, after which each new product

can be computed using only

operations in

2.5.Bluestein's chirp transform

We have shown above how to multiply polynomials using DFTs. Inversely, it is possible to reduce the computation of DFTs — of arbitrary length, not necessarily a power of two — to polynomial multiplication [3], as follows.

Let

be a principal

-th root of unity. For simplicity we assume that

is even, and that there exists some

with

. Consider the sequences

Now let

and

modulo

. Then (2.5) implies that

for all

. In other words, the computation of a DFT of even length

reduces to a cyclic convolution product of the same length, together with

additional operations in

. Notice that the polynomial

is fixed and independent of

in this product.

The only complication in the Turing model is the cost of extracting the

in the correct order, i.e., in the order

, given as input a precomputed table

. We may do this in time

by applying the strategy from section 2.3 to the pairs

for

. Similar remarks apply to the

Remark 2.3. It is also possible to give variants of the new multiplication algorithms in which Bluestein's transform is replaced by a different method for converting DFTs to convolutions, such as Rader's algorithm [41].

2.6.Kronecker substitution and segmentation

Multiplication in

may be reduced to multiplication in

using the classical technique of Kronecker substitution [21, Corollary 8.27]. More precisely, let

and

, and suppose that we are given two polynomials

of degree less than

, with coefficients

and

satisfying

and

. Then for the product

we have

. Consequently, the coefficients of

may be read off the integer product

where

. Notice that the integers

and

have bit length at most

, and the encoding and decoding processes have complexity

The inverse procedure is Kronecker segmentation. Given

and

, and non-negative integers

and

, we may reduce the computation of

to the computation of a product

of two polynomials

of degree less than

, and with

and

where

. Indeed, we may cut the integers into chunks of

bits each, so that

and

. Notice that we may recover

from

using an overlap-add procedure in time

. In our applications, we will always have

, so that

Kronecker substitution and segmentation can also be used to handle Gaussian integers (and Gaussian integer polynomials), and to compute cyclic convolutions. For example, given polynomials

with

, then for

we have

, so we may recover

from the cyclic Gaussian integer product

, where

. In the other direction, suppose that we wish to compute

for some

. We may assume that the “real” and “imaginary” parts of

and

are non-negative, and so reduce to the problem of multiplying

, where

and

, and where the real and imaginary parts of

are non-negative and have at most

bits.

3.Fixed point computations and error bounds

In this section, we consider the computation of DFTs over

in the Turing model. Elements of

can only be represented approximately on a Turing machine. We describe algorithms that compute DFTs approximately, using a fixed-point representation for

, and we give complexity bounds and a detailed error analysis for these algorithms. We refer the reader to [7] for more details about multiple precision arithmetic.

For our complexity estimates we will freely use the standard observation that

, since the multiplication of two integers of bit length

reduces to

multiplications of integers of bit length

, for any fixed

3.1.Fixed point numbers

We will represent fixed point numbers by a signed mantissa and a fixed exponent. More precisely, given a precision parameter

, we denote by

the set of complex numbers of the form

, where

for integers

and

satisfying

, i.e.,

. We write

for the set of complex numbers of the form

, where

and

; in particular, for

we always have

. At every stage of our algorithms, the exponent

will be determined implicitly by context, and in particular, the exponents do not have to be explicitly stored or manipulated.

In our error analysis of numerical algorithms, each

is really the approximation of some genuine complex number

. Each such

comes with an implicit error bound

; this is a real number for which we can guarantee that

. We also define the relative error bound for

. We finally denote by

the “machine accuracy”.

Remark 3.1. Interval arithmetic [36] (or ball arithmetic [28, Chapter 3]) provides a systematic method for tracking error bounds by storing the bounds along with

. We will use similar formulas for the computation of

and

, but we will not actually store the bounds during computations.

3.2.Basic arithmetic

In this section we give error bounds and complexity estimates for fixed point addition, subtraction and multiplication, under certain simplifying assumptions. In particular, in our DFTs, we only ever need to add and subtract numbers with the same exponent. We also give error bounds for fixed point convolution of vectors; the complexity of this important operation is considered later.

For

, we define the “round towards zero” function

and

. For

, we define

. Notice that

and

for any

Proposition 3.3 may be generalised to numerical cyclic convolution of vectors as follows.

Proof. Let

denote the exact convolution, and write

and

. As in the proof of Proposition 3.3, we obtain

and

3.3.Precomputing roots of unity

Let

and

. Let

be the branch of the square root function such that

for

. Using Newton's method [7, Section 3.5] and Schönhage–Strassen multiplication [47], we may construct a fixed point square root function

, which may be evaluated in time

, such that

for all

. For example, we may first compute some

such that

and

, and then take

; the desired bound follows since

Proof. The mean value theorem implies that

where

. For

we have

; hence

. By construction

. We conclude that

Proof. It suffices to compute

. Starting from

and

, for each

, we compute

for

using

and

otherwise. Performing all computations with temporarily increased precision

and corresponding

, Lemma 3.5 yields

. This also shows that the hypothesis

is always satisfied, since

. After rounding to

bits, the relative error is at most

3.4.Error analysis for fast Fourier transforms

A tight algorithm for computing DFTs of length

is a numerical algorithm that takes as input an

-tuple

and computes an approximation

to the DFT of

with respect to

(or

in the case of an inverse transform), such that

We assume for the moment that any such algorithm has at its disposal all necessary root tables with relative error not exceeding

. Propositions 3.2 and 3.3 directly imply the following:

Proof. The inner and outer DFTs contribute factors of

and

, and by Proposition 3.3 the twiddle factor multiplications contribute a factor of

. Thus

4.A simple and fast multiplication algorithm

In this section we give the simplest version of the new integer multiplication algorithm. The key innovation is an alternative method for computing DFTs of small length. This new method uses a combination of Bluestein's chirp transform and Kronecker substitution (see sections 2.5 and 2.6) to convert the DFT to a cyclic integer product in

for suitable

Proof. Let

, and suppose that we wish to compute the DFT of

. Using Bluestein's chirp transform (notation as in section 2.5), this reduces to computing a cyclic convolution of suitable

and

. We assume that the

and

have been precomputed with

We may regard

and

as cyclic polynomials with complex integer coefficients, i.e., as elements of

. Write

and

, where

with

and

. Now we compute the exact product

using Kronecker substitution. More precisely, we have

, so it suffices to compute the cyclic integer product

, where

. Then

is the exact convolution of

and

, and rounding

to precision

yields

in the sense of Proposition 3.4. A final multiplication by

yields the Fourier coefficients

To establish tightness, observe that

and

, so Proposition 3.4 yields

where

; we conclude that

. For

, this means that the algorithm is tight; for

, we may take

For the complexity, observe that the product in

reduces to three integer products of size

. These have cost

, and the algorithm also performs

multiplications in

, contributing the

term.

Remark 4.2. A crucial observation is that, for suitable parameters, the DFT algorithm in Proposition 4.1 is actually faster than the conventional Cooley–Tukey algorithm of Corollary 3.9. For example, if we assume that

, then to compute a transform of length

over

with

, the Cooley–Tukey approach has complexity

, whereas Proposition 4.1 yields

, an improvement by a factor of roughly

Proof. We first reduce our integer product to a polynomial product using Kronecker segmentation (section 2.6). Splitting the two

-bit inputs into chunks of

bits, we need to compute a product of polynomials

with non-negative

-bit coefficients and degrees less than

. The coefficients of

have

bits, and we may deduce the desired integer product

in time

Let

. To compute

, we will use DFTs of length

over

, where

. Zero-padding

to obtain a sequence

, and similarly for

, we compute the transforms

with respect to

as follows.

Let

and

. Write

with

for

and

. We use the algorithm

(see section 2.3), where for

we take

to be the tight algorithm

for DFTs of length

given by Proposition 4.1, and where

as in Corollary 3.9. In other words, we split the

usual radix-2 layers of the FFT into groups of

layers, handling the transforms in each group with the Bluestein–Kronecker reduction, and then using ordinary Cooley–Tukey for the remaining

layers.

We next compute the pointwise products

, and then apply an inverse transform

defined analogously to

. A final division by

(which is really just an implicit adjustment of exponents) yields approximations

Since

and

are tight by Propositions 3.8, 4.1 and Corollary 3.9, we have

, and similarly for

. Thus

, so

after the inverse transform (since

for

). In particular,

, so we obtain the exact value of

by rounding to the nearest integer.

Now we analyse the complexity. Using Proposition 3.6, we first compute a table of roots

in time

, and then extract the required twiddle factor tables in time

(see section 2.3). For the Bluestein reductions, we may extract a table of

-th roots in time

, and then rearrange them as required in time

(see section 2.5). These precomputations are then all repeated for the inverse transforms.

for all

. Let

for

. Increasing

if necessary, we may assume that

for

, so that the function

is well-defined. Increasing

if necessary, we may also assume that

for all

We prove by induction on

that

for all

. If

, then

, so the bound holds. Now suppose that

. Since

, we have

, so by induction

5.Logarithmically slow recurrence inequalities

This section is devoted to developing a framework for handling recurrence inequalities, similar to (4.1), that appear in subsequent sections.

Let

be a smooth increasing function, for some

. We say that

is an iterator of

is increasing and if

For instance, the standard iterated logarithm

defined in (1.2) is an iterator of

. An analogous iterator may be defined for any smooth increasing function

for which there exists some

such that

for all

. Indeed, in that case,

is well-defined and satisfies (5.1) for all

. It will sometimes be convenient to increase

so that

is satisfied on the whole domain of

for

. For example, the functions

and

are logarithmically slow, with

respectively.

Proof. The case

is clear. For

, let

. By induction

for large

, so

for large

In this paper, the main role played by logarithmically slow functions is to measure size reduction in multiplication algorithms. In other words, multiplication of objects of size

will be reduced to multiplication of objects of size

, where

for some logarithmically slow function

. The following result asserts that, from the point of view of iterators, such functions are more or less interchangeable with

Proof. First consider the case where

in (5.2), i.e., assume that

for some constant

and all

. Increasing

and

if necessary, we may assume that

for all

, and that

Now, given any

, let

, so

. For any

we have

, so

-fold iteration of (5.3), starting with

, yields

Now consider the general case

. Let

, so that

is an iterator of

. By the above argument

, and so

The next result, which generalises and refines the argument of Theorem 4.4, is our main tool for converting recurrence inequalities into actual asymptotic bounds for solutions. We state it in a slightly more general form than is necessary for the present paper, anticipating the more complicated situation that arises in [24].

Proof. Let

and

be as above. Define

for

. We claim that there exists

, depending only on

and

, such that

and hence

. This last inequality also clearly holds if

(since

). By Lemma 5.2 we obtain

for all

. Indeed, let

. If

then (5.5) holds as

. If

then

by (5.4), so

and hence

for all

. The base case

, i.e.,

, holds by assumption. Now assume that

, so

. By hypothesis there exist

, and

with

, such that

6.Even faster multiplication

In this section, we present an optimised version of the new integer multiplication algorithm. The basic outline is the same as in section 4, but our goal is now to minimise the “expansion factor” at each recursion level. The necessary modifications may be summarised as follows.

We begin with a technical preliminary. To perform multiplication in

efficiently using FFT multiplication, we need

to be divisible by a high power of two. We say that an integer

is admissible if

, where

(note that

for all

). We will need a function that rounds a given

up to an admissible integer. For this purpose we define

for

. Note that

may be computed in time

Proof. We have

, which implies (6.1). Since

and

, we have

and thus

, i.e.,

. In particular

, so

is admissible. (In fact, one easily checks that

is the smallest admissible integer

Remark 6.2. It is actually possible to drop the requirement that

be divisible by a high power of two, by using the Crandall–Fagin method (see section 9). We prefer to avoid this approach in this section, as it adds an unnecessary layer of complexity to the presentation.

Now let

be admissible, and consider the problem of computing

products

with

, i.e.,

products with one fixed operand. Denote the cost of this operation by

. Our algorithm for this problem will perform

forward DFTs and

inverse DFTs, so it is convenient to introduce the normalisation

This is well-defined since clearly

. Roughly speaking,

may be thought of as the notional cost of a single DFT.

The problem of multiplying

-bit integers may be reduced to the above problem by using zero-padding, i.e., by taking

and

. Since

and

, we obtain

. Thus it suffices to obtain a good bound for

The recursive step in the main multiplication algorithm involves computing “short” DFTs via the Bluestein–Kronecker device. As pointed out in section 2.5, this leads to a cyclic convolution with one fixed operand. To take advantage of the fixed operand, let

denote the cost of computing

independent DFTs of length

over

, and let

. Then we have the following refinement of Proposition 4.1. As usual we assume that the necessary Bluestein root table has been precomputed.

Proof. We use the same notation and algorithm as in the proof of Proposition 4.1, except that in the Kronecker substitution we take

, so that the resulting integer multiplication takes place in

. The proof of tightness is identical to that of Proposition 4.1 (this is where we use the assumption

). For the complexity bound, note that

is admissible by construction, so for any

we have

Here we have used the fact that

is fixed over all these multiplications. Dividing by

and taking suprema over

yields the result.

The next result gives the main recurrence satisfied by

(compare with Theorem 4.3).

Proof. Let

be admissible and sufficiently large, and consider the problem of computing

products

, for

. Let

, so that

, and let

We cut the inputs into

chunks of size

, i.e., if

is one of the

inputs, we write

, where

, and where the real and imaginary parts of

have absolute value at most

. Thus

, and for any

we may encode

as a polynomial

We will multiply the desired (cyclic) polynomials by using DFTs of length

over

where

. We construct the DFTs in a similar way to section 4. Let

and

. Write

with

for

and

. We use the tight algorithm

, where for

we take

to be the tight algorithm

for DFTs of length

given by Proposition 6.3, and where

as in Corollary 3.9. Thus, for the first

groups of

layers, we use Bluestein–Kronecker to reduce to complex integer convolution of size

, and the remaining layers are handled using ordinary Cooley–Tukey. We write

for the analogous inverse transform.

To check the hypothesis of Proposition 6.3, we observe that

for sufficiently large

, as

is divisible by

where

, and

Denote by

the cost of a single invocation of

(or

). By Corollary 3.9 and (2.4), we have

The last term is the rearrangement cost, and simplifies to

. The second term covers the invocations of

, and simplifies to

, so is absorbed by the

term. The first term covers the invocations of

. By definition

, and since

, Proposition 6.3 yields

We will use Schönhage–Strassen's algorithm for fixed point multiplications in

. Since

, we may take

. Thus the

term becomes

(We could of course use our algorithm recursively for these multiplications; however, it turns out that Schönhage–Strassen is fast enough, and leads to simpler recurrences. In fact, the algorithm asymptotically spends more time rearranging data than multiplying in

To compute the desired

products, we must execute

forward transforms and

inverse transforms. For each product, we must also perform

pointwise multiplications in

, at cost

. As in the proof of Theorem 4.3, the cost of all necessary root table precomputations is also bounded by

. Thus we obtain

The error analysis is almost identical to the proof of Theorem 4.3, the only difference being that

is replaced by

. Denoting one of the

products by

, we have

exactly as in Theorem 4.3. Thus

, and again we obtain

by rounding to the nearest integer.

Finally we show how to define

. We already observed that

. Thus there exists a constant

such that

for large

, so we may take

Proof of Theorem 1.1. Let

and

be as in Theorem 6.4. Increasing

if necessary, by Lemma 5.1 we may assume that

for

, and that

Let

for admissible

. By the theorem, there exist constants

such that for all admissible

, there exists an admissible

with

Increasing

if necessary, we may also assume that

for all admissible

. Taking

to be the set of admissible integers, we apply Proposition 5.3 with

, and for each admissible

setting

and

as above. We conclude that

, and hence

runs over admissible integers. We already pointed out that

7.An optimised variant of Fürer's algorithm

As pointed out in the introduction, Fürer proved that

for some

, but did not give an explicit bound for

. In this section we sketch an argument showing that one may achieve

in Fürer's algorithm, by reusing tools from previous sections, especially section 6.

At the core of Fürer's algorithm is the ring

, which contains the principal

-th root of unity

. Note that

is a direct sum of

copies of

, and hence not a field (for

). A crucial observation is that

is a “fast” root of unity, in the sense that multiplication by

and its powers can be achieved in linear time, as in Schönhage–Strassen's algorithm. For any

, we need to construct a

-th root

, which is itself a

-th principal root of unity. We recall Fürer's construction of

as follows.

As our basic recursive problem, we will consider multiplication in

, where

is divisible by a high power of two. We will refer to the last property as “admissibility”, but we will not define it precisely. We write

for the cost of

such products with one fixed argument, and

for the normalised cost, exactly as in section 6.

Fürer worked with

rather than

, but, since we are interested in constant factors, and since the recursive multiplication step involves multiplication of complex quantities, it simplifies the exposition to work systematically with complexified objects everywhere.

For suitable parameters

and

, we will encode elements of

as (nega)cyclic polynomials in

, where

as above. We choose the parameters later; for now we require only that

divides

and that

(so that the coefficients are not too small).

The encoding proceeds as follows. Given

, we split

into

parts

bits. Each

is cut into

even smaller pieces

bits. Then

is encoded as

and an element

is encoded as

. (Notice that the coefficients of

are zero for

; this zero-padding is the price Fürer pays for introducing artificial roots of unity.)

We represent complex coefficients by elements of

for a suitable precision parameter

. The exponent

varies during the algorithm, as explained in [19]; nevertheless, additions and subtractions only occur for numbers with the same exponent, as in the algorithms from sections 4 and 6.

Given

, to successfully recover the product

from the polynomial product

, we must choose

, where

is an allowance for numerical error. Certainly

, and, as shown by Fürer, we may also take

(an analogous conclusion is reached in sections 4 and 6). Thus we may assume that

We must now show how to compute a product

, for

. Fürer handles these types of multiplications using “half-DFTs”, i.e., DFTs that evaluate at odd powers of

, where

is a principal

-th root of unity such that

(Lemma 7.1). To keep terminology and notation consistent with previous sections, we prefer to make the substitution

, i.e., writing

, we put

, and similarly for

and

. This reduces the problem to computing the product

. The change of variable imposes a cost of

, where

is the cost of a multiplication in

So now consider a product

, where

. Let

, so that

. Let

, and write

with

for

and

. For each

, let

be the algorithm for DFTs of length

that applies the usual Cooley–Tukey method, taking advantage of the fast

-th root of unity

The complexity of

, since it performs

linear-time operations on objects of bit size

. Let

be the complexity of the algorithm

for DFTs of length

over

. Then (2.4) yields

Let us now consider the second term

, which describes the cost of the twiddle factor multiplications. This term turns out to be the dominant one. Both Kronecker substitution and FFT multiplication may be considered for multiplication in

, but it turns out that Kronecker substitution is faster (a similar phenomenon was noted in Remark 4.2). So we reduce multiplication in

to multiplication in

where

is admissible and divisible by

. For any reasonable definition of admissibility we then have

, provided that

is somewhat smaller than

. (In the interests of brevity, we will not specify the

terms for the remainder of the argument. They can all be controlled along the lines of section 6.) Most of the twiddle factors are reused many times, so we will assume that

, where the factor

counts the two (rather than three) DFTs needed for each multiplication of size

. The term of interest then becomes

To minimise the leading constant, we must choose

to grow faster than

, and

to grow faster than

. For example, taking

and

leads to

and

. The function mapping

is then bounded by a logarithmically slow function, and a similar argument to section 6 shows that

8.Fast multiplication using modular arithmetic

Shortly after Fürer's algorithm appeared, De et al [15] presented a variant based on modular arithmetic that also achieves the complexity bound

for some

. Roughly speaking, they replace the coefficient ring

with the field

-adic numbers, for a suitable prime

. In this context, working to “finite precision” means performing computations in

, where

is a precision parameter.

The main advantage of this approach is that the error analysis becomes trivial; indeed

is a ring (unlike our

), and arithmetic operations never lead to precision loss (unless one divides by

, which never happens in these algorithms). The main disadvantage is that there are certain technical difficulties associated with finding an appropriate

; this is discussed in section 8.2 below.

The aim of this section is to sketch an analogue of the algorithm of section 6 that achieves

using modular arithmetic instead of

. We assume familiarity with

-adic numbers, referring the reader to [22] for an elementary introduction.

8.1.Sketch of the algorithm

For the basic problem, we take multiplication in

, where

is admissible (in the sense of section 6) and where one of the arguments is fixed over

multiplications. As before, we take

, and cut the inputs into chunks of

bits. Thus we reduce to multiplying polynomials in

with coefficients of at most

bits. The coefficients of the product have at most

bits.

Let

be a prime such that

, so that

contains a primitive

-th root of unity

. The problem of finding such

and

is discussed in the next section; for now we assume only that

. We may then embed the multiplication problem into

, and use DFTs with respect to

to compute the product. On a Turing machine, we cannot represent elements of

exactly, so we perform all computations in

where

This choice ensures that

, so knowledge of the product in

determines it unambiguously in

To compute each DFT, we first use the Cooley–Tukey algorithm to decompose it into “short transforms” of length

, where

. (As in section 6, there are also residual transforms of length

for some

, whose contribution to the complexity is negligible.) Multiplications in

, such as the multiplications by twiddle factors, are handled using Schönhage–Strassen's algorithm, with the divisions by

being reduced to multiplication via Newton's method. We then use Bluestein's algorithm to convert each short transform to a cyclic convolution of length

over

, and apply Kronecker substitution to convert this to multiplication in

, where

is the smallest admissible integer exceeding

. This multiplication is then handled recursively.

Now, since

and

, we have

, and hence

, just as in section 6. The rest of the complexity analysis follows exactly as in the proof of Theorem 6.4, except for the computation of

and

, which is considered below.

Remark 8.1. The role of the precision parameter

is to give some extra flexibility regarding the choice of

. If there was an efficient way to find a prime

larger than

(but not too much larger), and an efficient way to find a suitable

-th root of unity modulo

, then we could always take

and obtain an algorithm working directly over the finite field

8.2.Computing suitable

and

Given a transform length

for

, our aim is to find a prime

such that

, i.e., such that

divides

. Denote by

the smallest such prime.

Heath-Brown has conjectured that

[25], but given the current state of knowledge in number theory, we are only able to prove a result of the following type.

Proof. This is a special case of Linnik's theorem [34, 35], which states that there exist constants

and

such that for any

with

, there exists a prime number

with

. The best currently known estimate

for

is due to Xylouris [53]. Applying this result for

and

, we get the bound

for large enough

. The complexity bound follows by testing

for primality until we find

, using a polynomial time primality test [1].

The difficulty with this result — already noted in [15] — is that the time required to find

greatly exceeds the time bound we are trying to prove for

To avoid this problem, De et al suggested using a multivariate splitting, i.e., by encoding each integer as a polynomial in

for suitable

, say

. One then uses

-dimensional DFTs to multiply the polynomials. Since the transform length is shorter, one can get away with a smaller

. Unfortunately, this introduces further zero-padding and leads to a larger value of

, ruining our attempt to achieve the bound

On the other hand, we note that the problem only really occurs at the top recursion level. Indeed, at deeper recursion levels, there is exponentially more time available at the previous level to compute

. So one possible workaround is to use a different, sufficiently fast algorithm at the top level, such as Fürer's algorithm, and then switch to the algorithm sketched in section 8.1 for the remaining levels. In this way one still obtains the bound

, and asymptotically almost all of the computation is done using the algorithm of section 8.1.

If one insists on avoiding

entirely, there are still many choices: one could use the algorithm of De et al at the top level, or use a multivariate version of the algorithm of section 8.1. One could even use the Schönhage–Strassen algorithm, whose main recursive step yields the bound

; applying this three times gives

, and then to multiply integers with

bits, one can find a suitable prime using Lemma 8.2 in time

Another way to work around the problem is to assume the generalised Riemann hypothesis (GRH). De et al pointed out that under GRH, it is possible to find a suitable prime efficiently using a randomised algorithm. Here we show that, under GRH, we can even use deterministic algorithms.

Proof. The first bound is given in [26], and the complexity bound follows similarly to the proof of Lemma 8.2.

To use this result, we must modify the algorithm of section 8.1 slightly. Choose a constant

so that we can compute

in time

, as in Lemma 8.3. Increase the coefficient size from

, and change the definition of admissibility accordingly. The transform length then decreases to

, and the cost of computing

decreases to only

. The rest of the complexity analysis is essentially unchanged; the result is an algorithm with complexity

, working entirely with modular arithmetic, in which the top recursion level does not need any special treatment.

Finally, we consider the computation of a suitable approximation to a

-th root of unity in

Proof. We may find a generator

deterministically in time

[48]. Then

is a primitive

-th root of unity in

, and there is a unique primitive

-th root of unity

congruent to

modulo

. Given

, we may compute

using fast Newton lifting in time

[9, Section 12.3].

In the context of section 8.1, we may assume that

and

, so the cost of finding

. This is certainly less than the cost of finding

itself, using either Lemma 8.2 or Lemma 8.3.

9.Conjecturally faster multiplication

It is natural to ask whether the approaches from sections 6, 7 or 8 can be further optimised, to obtain a complexity bound

with

In Fürer's algorithm, the complexity is dominated by the cost of multiplications in

. If we could use a similar algorithm for a much simpler

, then we might achieve a better bound. Such an algorithm was actually given by Fürer [17], under the assumption that there exist sufficiently many Fermat primes, i.e., primes of the form

. More precisely, his algorithm requires that there exists a positive integer

such that for every

, the sequence

contains a prime number. The DFTs are then computed directly over

for suitable

, taking advantage of the fact that

contains a fast

-th primitive root of unity (namely the element 2) as well as a

-th primitive root of unity. It can be shown that a suitably optimised version of this hypothetical algorithm achieves

: we still pay a factor of two due to the fact that we compute both forward and inverse transforms, and we pay another factor of two for the zero-padding in the recursive reduction. Unfortunately, it is likely that

is the last Fermat prime [13].

In the

algorithm of section 6, a potential bottleneck arises during the short transforms, when we use Kronecker substitution to multiply polynomials in

. We really only need the high

bits of each coefficient of the product (i.e., of the real and imaginary parts), but we are forced to allocate roughly

bits per coefficient in the Kronecker substitution, and then we discard roughly half of the output. This problem is similar to the well-known obstruction that prevents us from using FFT methods to compute a “short product”, i.e., the high

bits or low

bits of the product of two

-bit integers, any faster than computing the full

bits.

In this section, we present a variant of the algorithm of section 6, in which the coefficient ring

is replaced by a finite field

, where

is a Mersenne prime. Thus “short products” are replaced by “cyclic products”, namely by multiplications modulo

. This saves a factor of two at each recursion level, and consequently reduces

from

This change of coefficient ring introduces several technical complications. First, it is of course unknown if there are infinitely many Mersenne primes. Thus we are forced to rely on unproved conjectures about the distribution of Mersenne primes.

Second,

is always prime (except possibly at the top recursion level). Thus we cannot cut up an element of

into equal-sized chunks with an integral number of bits, and still expect to take advantage of cyclic products. In other words,

is very far from being admissible in the sense of section 6. To work around this, we deploy a variant of an algorithm of Crandall and Fagin [12], which allows us to work with chunks of varying size. The Crandall–Fagin algorithm was originally presented over

, and depended crucially on the fact that

contains suitable roots of

. In our setting, we work over

, where

is a Mersenne prime exponentially smaller than

. Happily,

contains suitable roots of

, and this enables us to adapt their algorithm to our setting. Moreover, since

, the field

contains roots of unity of high power-of-two order, namely of order

, so we can perform FFTs over

very efficiently.

Finally, we can no longer use Kronecker substitution, as this would reintroduce the very zero-padding we are trying to avoid. Instead, we take our basic problem to be polynomial multiplication over

(where

is not necessarily prime). After the Crandall–Fagin splitting step, we have a bivariate multiplication problem over

, which is solved using 2-dimensional FFTs over

. These FFTs are in turn reduced to 1-dimensional FFTs using standard methods; this dimension reduction is, roughly speaking, the analogue of Kronecker substitution in this algorithm. (Indeed, it is also possible to give an algorithm along these lines that works over

but avoids Kronecker substitution entirely; this still yields

because of the “short product” problem mentioned above.) For the 1-dimensional transforms, we use the same technique as in previous sections: we use Cooley–Tukey's algorithm to decompose them into “short transforms” of exponentially shorter length, then use Bluestein's method to convert them to (univariate) polynomial products, and finally evaluate these products recursively.

9.1.Mersenne primes

Let

denote the number of Mersenne primes less than

. Based on probabilistic arguments and numerical evidence, Lenstra, Pomerance and Wagstaff have conjectured that

, where

is the Euler constant [52, 40]. Our fast multiplication algorithm relies on the following slightly weaker conjecture.

An integer of the form

may be tested for primality in time

using the Lucas–Lehmer primality test [13]. A simple way to compute

is to apply this test successively for all

; this takes time

. A primitive

-th root of unity

may be computed by the formula

in time

; see [43] or [14, Corollary 5].

9.2.Crandall and Fagin's algorithm revisited

Let

be a Mersenne number (not necessarily prime). The main integer multiplication algorithm depends on a variant of Crandall and Fagin's algorithm that reduces multiplication in

to multiplication in

, where

is a suitably smaller Mersenne prime (assuming that such a prime exists).

To explain the idea of this reduction, we first consider the simpler univariate case, in which we reduce multiplication in

to multiplication in

. Here we require that

, that

and that

. For any

, we will write

and

Assume that we wish to compute the product of

. Considering

and

as elements of

modulo

, we decompose them as

We regard

and

as complex “digits” of

and

, where the base

varies with the position

. Notice that

takes only two possible values:

so that

. For any

, define

as follows. Choose

so that

lies in the interval

, and put

Since the left hand side lies in the interval

, this shows that

. Now, since

and

, we have

Since

and similarly for

, we have

. Note that we may recover

from

in time

, by a standard overlap-add procedure (provided that

Let

be the inverse of

modulo

; this inverse exists since we assumed

. Let

, so that

since

has order

. The quantity

plays the same role as the real

-th root of

appearing in Crandall–Fagin's algorithm.

coincides with the reinterpretation of

as an element of

. Moreover, we may recover

unambiguously from

, as

and

. Altogether, this shows how to reduce multiplication in

to multiplication in

Remark 9.3. The pair

can be computed from

bit operations, so we may compute the sequences

and

in time

. Moreover, since

takes on only two possible values, we may compute the sequence

using

multiplications in

9.3.Bivariate Crandall–Fagin reduction

Generalising the discussion of the previous section, we now show how to reduce multiplication in

, for a given

, to multiplication in

. For this, we require that

, that

and that

Indeed, consider two cyclic polynomials

and

. We cut each of the coefficients

into

chunks

and

of bit size at most

, using the same varying base strategy as above. With

and

as before, we next form the bivariate cyclic polynomials

Using the assumption that

, we recover the coefficients

, and hence the product

, from the bivariate cyclic convolution product

9.4.Conjecturally faster multiplication

Let

and

(not necessarily prime). We will take our basic recursive problem to be multiplication in

for suitable

. We need

somewhat larger than

; this is analogous to the situation in section 6, where we chose a “short transform length” somewhat larger than the coefficient size. Thus we set

where

is defined as follows.

Proof. Let

. Using [6], we may construct a function

such that

for all

, and which may be computed in time

. One checks that

for all

, so

for

. Thus

is increasing, and

has the desired properties.

We say that an integer

is admissible if it is of the form

where

for some

. (This should not be confused with the notion of admissibility of section 6.) An element of

is then represented by

bits. Note that

is strictly increasing, so there is a one-to-one correspondence between integers

and admissible

. For

we define

to be the smallest admissible integer

Suppose that we wish to compute

for some

. We assume that

is large enough that the definition

makes sense and so that

. One checks that

, so

and hence

. To find the smallest suitable

, we may simply compute

for each

, and compare with

. This takes time

Now let

and

. Consider the problem of computing

products

with

. We denote by

the complexity of this problem, where

is the admissible integer corresponding to

. As in section 6, we define

Notice that multiplication of two integers of bit size

reduces to the above problem, for

, via a suitable Kronecker segmentation. Indeed, let

for some

, and encode the integers as integer polynomials of degree less than

with coefficients of bit size

. The desired product may be recovered from the product in

, as

for large

. Thus, as in section 6, we have

, and it suffices to obtain a good bound for

Now suppose additionally that

is prime. In this case

is a field, and as noted above, it contains

-th roots of unity, so we may define DFTs of length

over

for any

. In particular, for

we may use Bluestein's algorithm to compute DFTs of length

. Denote by

the cost of evaluating

independent DFTs of length

over

, and put

. Here we assume as usual that a

-th root of unity is known, and that the corresponding Bluestein root table has been precomputed.

Let us apply these definitions in the case

; this is permissible, as

for sufficiently large

. Since convolution of length

over

is exactly the basic recursive problem, and since one of the operands is fixed, we have

, where

, and hence

Proof. Let

with

. Assume that we wish to compute

products with one fixed operand. Our goal is to reduce to a problem of the same form, but for exponentially smaller

Choose parameters. Let

be the smallest Mersenne prime larger than

. By Proposition 9.2, we have

, whence

, for some absolute constant

. Moreover, we may compute

, together with a primitive

-th root of unity

, in time

. We define

and

The algorithm must perform various multiplications in

, at cost

. For simplicity we will use Schönhage–Strassen's algorithm for these multiplications, i.e., we will take

. Since

, we have

Crandall–Fagin reduction. We use the framework of section 9.3 to reduce the basic multiplication problem in

to multiplication in

for suitable

. We take

where

We also write

. The definition of

makes sense for large

since

. Let us check that the hypotheses of section 9.3 are satisfied for large

. We have

and hence

; in particular,

, so

, and also

. Since

, we also have

, and thus

since

To assess the cost of the Crandall–Fagin reduction, we note that computing the

and

costs

(see Remark 9.3), the splitting itself and final overlap-add phase require time

, and the various multiplications by

and

have cost

Reduction to power-of-two lengths. Next we reduce multiplication in

to multiplication in

, where

. In fact, since

, these rings are isomorphic, via the map that sends

and

. Evaluating this isomorphism corresponds to rearranging the coefficients according to the rule

, where

is the exponent of

and where

and

are the exponents of

and

. This may be achieved in time

using the same sorting strategy as in section 2.3. The inverse rearrangement is handled similarly.

Reduction to univariate transforms. For multiplication in

, we will use bivariate DFTs over

. This is possible because

contains both

-th and

-th primitive roots of unity, namely

and

, since

and

. More precisely, we must perform

forward bivariate DFTs and

inverse bivariate DFTs of length

over

, and

multiplications in

. Each bivariate DFT reduces further to

univariate DFTs of length

over

(with respect to

) and

univariate DFTs of length

over

(with respect to

). Interspersed between these steps are various matrix transpose operations of total cost

, to enable efficient access to the “rows” and “columns” (see section 2.1).

Multiplications in

are handled by zero-padding, i.e., we first use Cooley–Tukey to multiply in

, and then reduce modulo

. The total cost of these multiplications is

Reduction to short transforms. Consider one of the “long” univariate DFTs of length

over

. We decompose the DFT into “short” DFTs of length

as follows. Let

and

, and write

where

for

and

. We use the algorithm

, where for

we take

to be the algorithm based on Bluestein's method (discussed immediately before (9.4)), and where

is the usual Cooley–Tukey algorithm over

. Let

be the cost of a single invocation of

(or of the corresponding inverse transform

). By (2.4) we have

The cost of precomputing the necessary root tables is only

. By definition

. From (9.4) and the estimate

, the first term becomes

Denoting by

the cost of a bivariate DFT of length

over

, we thus have (ignoring the transposition costs, which were included earlier)

We must perform

bivariate DFTs; the bound (9.5) then follows exactly as in the proof of Theorem 6.4.

For large

, we have

, so

. Thus there exists a constant

such that

for large

, and we may take

Proof of Theorem 1.2. Follows from Theorem 9.6 and Proposition 5.3, analogously to the proof of Theorem 1.1.

Bibliography

[1]: M. Agrawal, N. Kayal, and N. Saxena. PRIMES is in P. Annals of Math., 160(2):781–793, 2004.
[2]: A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The design and analysis of computer algorithms. Addison-Wesley, 1974.
[3]: L. I. Bluestein. A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics, 18(4):451–455, 1970.
[4]: A. Bostan, P. Gaudry, and É. Schost. Linear recurrences with polynomial coefficients and application to integer factorization and Cartier-Manin operator. SIAM J. Comput., 36:1777–1806, 2007.
[5]: C. B. Boyer. A History of Mathematics. Princeton Univ. Press, first paperback edition, 1985.
[6]: R. P. Brent. Fast multiple-precision evaluation of elementary functions. J. Assoc. Comput. Mach., 23(2):242–251, 1976.
[7]: R. P. Brent and P. Zimmermann. Modern Computer Arithmetic. Cambridge University Press, 2010.
[8]: P. Bürgisser, M. Clausen, and M. A. Shokrollahi. Algebraic complexity theory. Springer-Verlag, 1997.
[9]: H. Cohen, G. Frey, R. Avanzi, Ch. Doche, T. Lange, K. Nguyen, and F. Vercauteren, editors. Handbook of elliptic and hyperelliptic curve cryptography. Discrete Mathematics and its Applications. Chapman & Hall/CRC, Boca Raton, FL, 2006.
[10]: S. A. Cook. On the minimum computation time of functions. PhD thesis, Harvard University, 1966.
[11]: J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Computat., 19:297–301, 1965.
[12]: R. Crandall and B. Fagin. Discrete weighted transforms and large-integer arithmetic. Math. Comp., 62(205):305–324, 1994.
[13]: R. Crandall and C. Pomerance. Prime numbers. A computational perspective. Springer, New York, 2nd edition, 2005.
[14]: R. Creutzburg and M. Tasche. Parameter determination for complex number-theoretic transforms using cyclotomic polynomials. Math. Comp., 52(185):189–200, 1989.
[15]: A. De, P. P. Kurur, C. Saha, and R. Saptharishi. Fast integer multiplication using modular arithmetic. SIAM J. Comput., 42(2):685–699, 2013.
[16]: J. Écalle. Introduction aux fonctions analysables et preuve constructive de la conjecture de Dulac. Hermann, collection: Actualités mathématiques, 1992.
[17]: M. Fürer. On the complexity of integer multiplication (extended abstract). Technical Report CS-89-17, Pennsylvania State University, 1989.
[18]: M. Fürer. Faster integer multiplication. In Proceedings of the Thirty-Ninth ACM Symposium on Theory of Computing, STOC 2007, pages 57–66, New York, NY, USA, 2007. ACM Press.
[19]: M. Fürer. Faster integer multiplication. SIAM J. Comp., 39(3):979–1005, 2009.
[20]: M. Fürer. How fast can we multiply large integers on an actual computer? Technical report, http://arxiv.org/abs/1402.1811, 2014.
[21]: J. von zur Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, 2nd edition, 2002.
[22]: F. Q. Gouvêa. -adic numbers. An introduction. Universitext. Springer-Verlag, Berlin, 1993.
[23]: T. Granlund et al. GMP, the GNU multiple precision arithmetic library. http://gmplib.org, 1991. Latest version 6.0.0 released in 2014.
[24]: D. Harvey, J. van der Hoeven, and G. Lecerf. Faster polynomial multiplication over finite fields. Technical report, HAL, 2014. http://hal.archives-ouvertes.fr.
[25]: D. R. Heath-Brown. Almost-primes in arithmetic progressions and short intervals. Math. Proc. Cambridge Philos. Soc., 83(3):357–375, 1978.
[26]: D. R. Heath-Brown. Zero-free regions for Dirichlet -functions, and the least prime in an arithmetic progression. Proc. London Math. Soc. (3), 64(2):265–338, 1992.
[27]: M. T. Heideman, D. H. Johnson, and C. S. Burrus. Gauss and the history of the fast Fourier transform. Arch. Hist. Exact Sci., 34(3):265–277, 1985.
[28]: J. van der Hoeven. Journées Nationales de Calcul Formel (2011), volume 2 of Les cours du CIRM, chapter Calcul analytique. CEDRAM, 2011. Exp. No. 4, 85 pages, http://ccirm.cedram.org/ccirm-bin/fitem?id=CCIRM_2011__2_1_A4_0.
[29]: J. van der Hoeven, G. Lecerf, B. Mourrain, et al. Mathemagix, 2002. http://www.mathemagix.org.
[30]: A. Karatsuba and J. Ofman. Умножение многозначных чисел на автоматах. Doklady Akad. Nauk SSSR, 7:293–294, 1962. English translation in [31].
[31]: A. Karatsuba and J. Ofman. Multiplication of multidigit numbers on automata. Soviet Physics Doklady, 7:595–596, 1963.
[32]: D. E. Knuth. The Art of Computer Programming, volume 2: Seminumerical Algorithms. Addison-Wesley, 1969.
[33]: D. E. Knuth. The art of computer programming, volume 3: Sorting and Searching. Addison-Wesley, Reading, MA, 1998.
[34]: Yu. V. Linnik. On the least prime in an arithmetic progression I. The basic theorem. Rec. Math. (Mat. Sbornik) N.S., 15(57):139–178, 1944.
[35]: Yu. V. Linnik. On the least prime in an arithmetic progression II. The Deuring-Heilbronn phenomenon. Rec. Math. (Mat. Sbornik) N.S., 15(57):347–168, 1944.
[36]: R. E. Moore. Interval Analysis. Prentice Hall, Englewood Cliffs, N.J., 1966.
[37]: O. Neugebauer. The Exact Sciences in Antiquity. Brown Univ. Press, Providence, R.I., 1957.
[38]: C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.
[39]: J. M. Pollard. The fast Fourier transform in a finite field. Math. Comp., 25(114):365–374, 1971.
[40]: C. Pomerance. Recent developments in primality testing. Math. Intelligencer, 3(3):97–105, 1980/81.
[41]: C. M. Rader. Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE, 56(6):1107–1108, June 1968.
[42]: K. R. Rao, D. N. Kim, and J. J. Hwang. Fast Fourier Transform - Algorithms and Applications. Signals and Communication Technology. Springer-Verlag, 2010.
[43]: I. S. Reed and T. K. Truong. The use of finite fields to compute convolutions. IEEE Trans. Inform. Theory, IT-21:208–213, 1975.
[44]: M. C. Schmeling. Corps de transséries. PhD thesis, Université Paris-VII, France, 2001.
[45]: A. Schönhage. Multiplikation großer Zahlen. Computing, 1(3):182–196, 1966.
[46]: A. Schönhage. Storage modification machines. SIAM J. on Comp., 9:490–508, 1980.
[47]: A. Schönhage and V. Strassen. Schnelle Multiplikation großer Zahlen. Computing, 7:281–292, 1971.
[48]: I. Shparlinski. On finding primitive roots in finite fields. Theoret. Comput. Sci., 157(2):273–275, 1996.
[49]: D. E. Smith. History of Mathematics, volume 2. Dover, 1958.
[50]: A. L. Toom. The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Mathematics, 4(2):714–716, 1963.
[51]: A. L. Toom. О сложности схемы из функциональных элементов, реализующей умножение целых чисел. Doklady Akad. Nauk SSSR, 150:496–498, 1963. English translation in [50].
[52]: S. Wagstaff. Divisors of Mersenne numbers. Math. Comp., 40(161):385–397, 1983.
[53]: T. Xylouris. On the least prime in an arithmetic progression and estimates for the zeros of Dirichlet L-functions. Acta Arith., 1:65–91, 2011.

1.Introduction

1.1.Brief history and related work

1.2.Our contributions and outline of the paper

2.Survey of classical tools

2.1.Arrays and sorting

2.2.Discrete Fourier transforms

2.3.The Cooley–Tukey FFT

2.4.Fast Fourier multiplication

2.5.Bluestein's chirp transform

2.6.Kronecker substitution and segmentation

3.Fixed point computations and error bounds

3.1.Fixed point numbers

3.2.Basic arithmetic

3.3.Precomputing roots of unity

3.4.Error analysis for fast Fourier transforms

4.A simple and fast multiplication algorithm

5.Logarithmically slow recurrence inequalities

6.Even faster multiplication

7.An optimised variant of Fürer's algorithm

8.Fast multiplication using modular arithmetic

8.1.Sketch of the algorithm

8.2.Computing suitable and

9.Conjecturally faster multiplication

9.1.Mersenne primes

9.2.Crandall and Fagin's algorithm revisited

9.3.Bivariate Crandall–Fagin reduction

9.4.Conjecturally faster multiplication

Bibliography