
Let be a prime, and let denote the bit complexity of multiplying two polynomials in of degree less than . For large compared to , we establish the bound , where is the iterated logarithm. This is the first known Fürertype complexity bound for , and improves on the previously best known bound .

Given a ring , a basic problem in complexity theory is to find an upper bound for the cost of multiplying two polynomials in of degree less than . Several complexity models may be considered. In algebraic complexity models, such as the straightline program model [9, Chapter 4], we count the number of ring operations in , denoted . In this model, the best currently known bound was obtained by Cantor and Kaltofen [10]. More precisely, their algorithm performs additions and subtractions and multiplications in . Their bound may be viewed as an algebraic analogue of the Schönhage–Strassen result for integer multiplication [32] and generalises previous work by Schönhage [31]. These algorithms all rely on suitable incarnations of the fast Fourier transform (FFT) [12]. For details we refer the reader to [17, Chapter 8].
In this paper we are mainly interested in the case that is the finite field for some prime . The standard bit complexity model based on deterministic multitape Turing machines [26] is more realistic in this setting, as it takes into account the dependence on . We write for the bit complexity of multiplying two polynomials in of degree less than . Two basic approaches are known for obtaining good asymptotic bounds for .
If is not too large compared to , then one may use Kronecker substitution, which converts the problem to integer multiplication by packing the coefficients into large integers. Let denote the bit complexity of bit integer multiplication. Throughout the paper, we make the customary assumption that is increasing, and for convenience we define for , . According to our recent sharpening [18] of Fürer's bound [15], we may take
where denotes the iterated logarithm for , i.e.,
If , then Kronecker substitution yields (see section 2.6), so we have
However, Kronecker substitution leads to inferior bounds when is large compared to , due to coefficient growth in the lifted product in . In this situation the best existing method is the algebraic Schönhage–Strassen algorithm (i.e., the Cantor–Kaltofen algorithm). In the Turing model this yields the bound
The first term dominates for large enough , say for , and we get simply
Since Fürer's breakthrough result, there has been a gap between what is known for integer multiplication and for multiplication in . Namely, it has remained an open question whether the factor appearing in (1.3) can be improved to for some . It seems that Fürer's construction does not work in .
The main result of this paper is a new algorithm that completely closes the gap mentioned above.
The bound follows immediately via Kronecker substitution if . The bulk of the paper concerns the reverse case where is large compared to . For example, if , the bound becomes simply
As promised, this replaces the factor in (1.3) by .
The basic idea of the new algorithm is as follows. We first construct a special extension of , where is exponentially smaller than , and for which has many small divisors. These divisors are themselves exponentially smaller than , yet they are so numerous that their product is comparable to . We now convert the given multiplication problem in to multiplication in , by cutting up the polynomials into small pieces, and we then multiply in by using FFTs over . Applying the Cooley–Tukey method to the small divisors of , we decompose the FFTs into transforms of exponentially shorter lengths (not necessarily powers of two). We use Bluestein's method to convert each short transform to a convolution over , and then use Kronecker substitution (from bivariate to univariate polynomials) to convert this to multiplication in . These latter multiplications are then handled recursively. We continue the recursion until is comparable to , at which point we switch to using Kronecker substitution (from polynomials to integers).
In many respects this approach is very similar to the algorithm for integer multiplication introduced in [18], but there are various additional technical issues to address. We recommend [18] as a gentle introduction to the main ideas.
We also prove the following conditional result.
holds uniformly for all and all primes .
Of these conjectures, Conjecture 8.3 seems to be in reach for specialists in analytic number theory; Conjecture 8.5 is at least as hard as Artin's conjecture on primitive roots, but may be tractable under the generalised Riemann hypothesis (GRH); and Conjecture 8.1 is almost as strong as the Lenstra–Pomerance–Wagstaff conjecture on the distribution of Mersenne primes, and seems to be inaccessible at present.
The paper is structured as follows. In section 2, we start with a survey of relevant basic techniques: discrete Fourier transforms (DFTs) and FFTs, Bluestein's chirp transform, and Kronecker substitution. Most of this material is repeated from [18, Section 2] for the convenience of the reader; however, sections 2.5 and 2.6 differ substantially.
In section 3, we recall basic complexity results for arithmetic in finite fields. In particular, we consider the construction of irreducible polynomials in , algorithms for finding roots of unity in , and the cost of arithmetic in and .
In section 4, we show how to construct special extensions of whose multiplicative group has a large subgroup of highly smooth order, i.e., is divisible by many small integers. The main tool is [2, Theorem 3].
Section 5 gives complexity bounds for functions that satisfy recurrence inequalities involving postcompositions with “logarithmically slow” functions. The prototype of such an inequality is , where and are constants. The definitions and theorems are duplicated from [18, Section 5]; for the proofs, see [18].
To minimise the constant in the bound (for large relative to ), we need one more tool: in section 6, we present a polynomial analogue of the Crandall–Fagin convolution algorithm [13]. This allows us to convert a cyclic convolution over of length , where is arbitrary, into a cyclic convolution over of somewhat smaller length , where is prescribed and where . (We can still obtain without using this Crandall–Fagin analogue, but we do not know how to reach without it.)
Section 7 is devoted to the proof of Theorem 1.1. Section 8 gives the proof of Theorem 1.2, and discusses the three numbertheoretic hypotheses of that theorem.
The last section offers some final notes and suggested directions for generalisation. We first quickly dispense with the bit complexity of multiplication in for prime powers , and in for arbitrary integers . We then sketch some algebraic complexity bounds for polynomial multiplication over algebras and algebras, especially in the straightline program model. Our techniques also give rise to new strategies for polynomial evaluationinterpolation over . This may for instance be applied to the efficient multiplication of polynomial matrices over . Although we have not implemented any of our algorithms yet, we conclude by a few remarks on their expected efficiency in practice.
Notations. We use Hardy's notations for , and for and . The “softOh” notation means that (see [17, Chapter 25, Section 7] for details). The symbol denotes the set of nonnegative real numbers, and denotes . For a ring and , we write .
We will write . In expressions like or , we tacitly assume that the function is adjusted so as to take positive values for small primes such as .
Acknowledgments. We would like to thank Karim Belabas for directing us to the paper [2], and Igor Shparlinski for answering some analytic number theory questions.
This section recalls basic facts on Fourier transforms and related techniques used in subsequent sections. For more details and historical references we refer the reader to standard books on the subject such as [3, 9, 17, 30].
In the Turing model, we have available a fixed number of linear tapes. An array of bit elements is stored as a linear array of bits. We generally assume that the elements are ordered lexicographically by , though this is just an implementation detail.
What is significant from a complexity point of view is that occasionally we must switch representations, to access an array (say 2dimensional) by “rows” or by “columns”. In the Turing model, we may transpose an matrix of bit elements in time , using the algorithm of [7, Appendix]. Briefly, the idea is to split the matrix into two halves along the “short” dimension, and transpose each half recursively.
We will also require more complex rearrangements of data, for which we resort to sorting. Suppose that is a totally ordered set, whose elements are represented by bit strings of length , and suppose that we can compare elements of in time . Then an array of elements of may be sorted in time using merge sort [22], which can be implemented efficiently on a Turing machine.
Let be a commutative ring with identity and let . An element is said to be a principal th root of unity if and
(2.1) 
for all . In this case, we define the discrete Fourier transform (or DFT) of an tuple with respect to to be where
That is, is the evaluation of the polynomial at .
If is a principal th root of unity, then so is its inverse , and we have
Indeed, writing , the relation (2.1) implies that
where if and otherwise.
Remark
Let be a principal th root of unity and let where . Then is a principal th root of unity and is a principal th root of unity. Moreover, for any and , we have
If and are algorithms for computing DFTs of length and , we may use (2.2) to construct an algorithm for computing DFTs of length as follows.
For each , the sum inside the brackets corresponds to the th coefficient of a DFT of the tuple with respect to . Evaluating these inner DFTs requires calls to . Next, we multiply by the twiddle factors , at a cost of operations in . (Actually, fewer than multiplications are required, as some of the twiddle factors are equal to . This optimisation, while important in practice, has no asymptotic effect on the algorithms discussed in this paper.) Finally, for each , the outer sum corresponds to the th coefficient of a DFT of an tuple in with respect to . These outer DFTs require calls to .
Denoting by the number of ring operations needed to compute a DFT of length , and assuming that we have available a precomputed table of twiddle factors, we obtain
For a factorisation , this yields recursively
The corresponding algorithm is denoted . The operation is neither commutative nor associative; the above expression will always be taken to mean .
The above discussion requires several modifications in the Turing model. Assume that elements of are represented by bits.
First, for , we must add a rearrangement cost of to efficiently access the rows and columns for the recursive subtransforms (see section 2.1). For the general case , the total rearrangement cost is bounded by .
Second, we will sometimes use nonalgebraic algorithms to compute the subtransforms, so it may not make sense to express their cost in terms of . The relation (2.3) therefore becomes
where is the (Turing) cost of a transform of length over , and where is the cost of a single multiplication in .
Finally, we point out that requires access to a table of twiddle factors , ordered lexicographically by , for , . Assuming that we are given as input a precomputed table of the form , we must show how to extract the required twiddle factor table in the correct order. We first construct a list of triples , ordered by , in time ; then sort by in time (see section 2.1); then merge with the given root table to obtain a table , ordered by , in time ; and finally sort by in time . The total cost of the extraction is thus .
The corresponding cost for is determined as follows. Assuming that the table is given as input, we first extract the subtables of ()th roots of unity for in time . Extracting the twiddle factor table for the decomposition then costs ; the total over all is again .
Remark
Let be a principal th root of unity in and assume that is invertible in . Consider two polynomials and in . Let be the polynomial defined by
where the product of the DFTs is taken pointwise. By construction, we have , which means that for all . The product of and modulo also satisfies for all . Consequently, , , whence .
For polynomials with and , we thus obtain an algorithm for the computation of modulo using at most operations in . Modular products of this type are also called cyclic convolutions. If , then we may recover the product from its reduction modulo . This multiplication method is called FFT multiplication.
If one of the arguments (say ) is fixed and we want to compute many products (or cyclic convolutions) for different , then we may precompute , after which each new product can be computed using only operations in .
We have shown above how to multiply polynomials using DFTs. Inversely, it is possible to reduce the computation of DFTs — of arbitrary length, not necessarily a power of two — to polynomial multiplication [5], as follows.
Let be a principal th root of unity and consider an tuple . First consider the case that is odd. Let
Note that and are integers, and that
Let , and modulo . Then
In other words, the computation of a DFT of odd length reduces to a cyclic convolution product of the same length , together with additional operations in . Notice that the polynomial is fixed and independent of in this product.
Now suppose that is even. In this case we require that be invertible in , and that . Let , and put
Then
and
As above, let , and modulo . Then for we have
and similarly
Again, the DFT reduces to a convolution of length , together with additional operations in .
The only complication in the Turing model is the cost of extracting the , and in the correct order. For example, consider the in the case that is odd. Given as input a precomputed table , we may extract the in time by applying the strategy from section 2.3 to the pairs for . The other sequences are handled similarly. For the in the case that is even, we also need to perform additions in ; assuming that additions in have cost , this is already covered by the above bound.
Remark
Remark
Multiplication in may be reduced to multiplication in using the classical technique of Kronecker substitution [17, Corollary 8.27]. More precisely, let and suppose that we are given two polynomials of degree less than . Let be lifts of these polynomials, with coefficients and satisfying . Then for the product we have , and the coefficients of may be read off the integer product where . We deduce the coefficients of by dividing each by . This shows that . Using the assumption that is increasing, we obtain
If we also assume that , i.e., that the degree is not too large, then this becomes simply
A second type of Kronecker substitution reduces bivariate polynomial multiplication to the univariate case. Indeed, let and , and suppose that , where and . We may then recover from the univariate product in . Note that and have degree less than , so the cost of the bivariate product is .
The same method works for computing cyclic convolutions: to multiply , the same substitution leads to a product in . The cost is thus , where denotes the cost of a multiplication in .
Let be a prime and let . In this section we review basic results concerning arithmetic in and .
We assume throughout that is represented as for some irreducible monic polynomial of degree . Thus an element of is represented uniquely by its residue modulo , i.e., by a polynomial of degree less than .
Proof. See [33, Theorem 3.2].
The above complexity bound is very pessimistic in practice. Better complexity bounds are known if we allow randomised algorithms, or unproved hypotheses. For instance, assuming GRH, the bound reduces to [1].
We now consider the cost of arithmetic in , assuming that is given. Addition and subtraction in have cost . For multiplication we will always use the Schönhage–Strassen algorithm. Denote by the cost of multiplying polynomials in by this method, i.e., using the polynomial variant [31] for the polynomials themselves, followed by the integer version [32] to handle the coefficient multiplications. Then we have
Of course, we could use the new multiplication algorithm recursively for these products, but it turns out that Schönhage–Strassen is fast enough, and leads to a simpler complexity analysis in section 7. Let denote the cost of dividing a polynomial in by , returning the quotient and remainder. Using Newton's method [17, Chapter 9] we have . Thus elements of may be multiplied in time .
Let be a divisor of . To compute a DFT over of length , we must first have access to a primitive th root of unity in . In general it is very difficult to find a primitive root of , as it requires knowledge of the factorisation of . However, to find a primitive th root, Lemma 3.3 below shows that it is enough to know the factorisation of . The construction relies on the following existence result.
Proof. This is [34, Theorem 1.1].
Proof. Testing whether a given is a primitive th root of unity reduces to checking that and for every prime divisor of . According to Lemma 3.2, it suffices to apply this test to for running over all monic polynomials of degree , where is as in the lemma. The number of candidates is at most , and each candidate can be tested in time using binary exponentiation.
Of course, we can do much better if randomised algorithms are allowed, since is reasonably likely to be a primitive th root for randomly selected .
Finally, we consider polynomial multiplication over . Let , and let . Let be their natural lifts, i.e., of degree less than with respect to . The bivariate product may be computed using Kronecker substitution (section 2.6) in time . Writing , to recover we must divide each by . Denoting by the complexity of the original multiplication problem, we obtain
As in section 2.6, the same method may be used for cyclic convolutions. If denotes the cost of multiplication in , then we get
Remark
The aim of this section is to prove the following theorem, which allows us to construct “small” extensions of containing “many” roots of unity of “low” order. Recall that a positive integer is said to be smooth if all of its prime divisors are less than or equal to .
and a smooth integer , such that . Moreover, given and , we may compute and the prime factorisation of M in time .
The proof depends on a series of preparatory lemmas. For , define
For example, and . Note that is always squarefree and smooth. We are interested in finding for which is unusually large; that is, for which has many divisors such that happens to be prime. Most of the heavy lifting is done by the following remarkable result.
Proof. This is part of [2, Theorem 3]. (The threshold could of course be replaced by any fixed power of . It is stated this way in [2] because that paper is concerned with primality testing.)
The link between and is as follows.
Proof. We take if divides , and otherwise . In the former case we must have , so in both cases . To see that , consider any prime divisor of . Then , so by Fermat's little theorem.
Remark
Next we give a simple sieving algorithm that tabulates approximations of for all up to a prescribed bound.
Proof. Initialise a table of with for . Since , it suffices to set aside bits for each .
For each integer , perform the following steps. First test whether is prime, discarding it if not. Using trial division, the cost is , so overall. Now assume that is found to be prime. Using a fast algorithm for computing logarithms [8], compute an integer such that , where , in time . Scan through the table, replacing by for those divisible by , i.e., every th entry, in time . The total cost for each prime is , and there are primes, so the overall cost is . At the end, we divide each by , yielding the required approximations satisfying (4.1).
Proof of Theorem 4.1. We are given and as input. Applying Lemma 4.2 with , we find that for large enough there exists some with
and such that . Choose any positive and any . Since
we have
for large . Using Lemma 4.5, we may find one such in time , where .
Let be as in Lemma 4.3. Then divides , and
for large . We may compute the prime factorisation of by simply enumerating the primes , , and checking whether divides for each . This can be done in time .
The main multiplication algorithm depends on a reduction of a “long” DFT to many “short” DFTs. It is essential to have some control over the long and short transform lengths. The following result packages together the prime divisors of the abovementioned , to obtain a long transform length near a given target , and short transform lengths near a given target .
Given , , , and the prime factorisation of , we may compute such (and their factorisations) in time .
Proof. Let be the prime decomposition of . Taking minimal with , we ensure that (a) and (b) are satisfied. At this stage we have for all . As long as the tuple contains an entry with , we pick the two smallest entries and and replace them by a single entry . Clearly, this does not alter the product of all entries, so (a) and (b) continue to hold. Furthermore, as long as there exist two entries and with and , new entries will always be smaller than . Only at the very last step of the loop, the second smallest entry might be larger than . In that case, the product of the entry with with is still bounded by . This shows that condition (c) is satisfied at the end of the loop.
Determining requires at most multiplications of integers less than . There are at most iterations of the replacement loop. Each iteration must scan through at most integers of bit size and perform one multiplication on such integers.
Let be a smooth increasing function, for some . We say that is an iterator of if is increasing and if
for all sufficiently large .
For instance, the standard iterated logarithm defined in (1.2) is an iterator of . An analogous iterator may be defined for any smooth increasing function for which there exists some such that for all . Indeed, in that case,
is welldefined and satisfies (5.1) for all . It will sometimes be convenient to increase so that is satisfied on the whole domain of .
We say that is logarithmically slow if there exists an such that
for . For example, the functions , , and are logarithmically slow, with respectively.
In this paper, the main role played by logarithmically slow functions is to measure size reduction in multiplication algorithms. In other words, multiplication of objects of size will be reduced to multiplication of objects of size , where for some logarithmically slow function . The following result asserts that, from the point of view of iterators, such functions are more or less interchangeable with .
The next result is our main tool for converting recurrence inequalities into actual asymptotic bounds for solutions.
Let and . Let , and let be any function satisfying the following recurrence. First, for all , . Second, for all , , there exist with , and weights with , such that
Then we have for all , .
Consider the problem of computing the product of two bit integers modulo . The naive approach is to compute their ordinary bit product and then reduce modulo . The reduction cost is negligible because of the special form of . If is divisible by a high power of two, one can save a factor of two by using the fact that FFTs naturally compute cyclic convolutions.
An ingenious algorithm of Crandall and Fagin [13] allows for the gain of this precious factor of two for arbitrary . Their algorithm is routinely used in the extreme case where is prime, in the largescale GIMPS search for Mersenne primes (see http://www.mersenne.org/).
A variant of the Crandall–Fagin algorithm was a key ingredient of our algorithm for integer multiplication that conjecturally achieves the bound [18, Section 9]. In this section we present yet another variant, for computing products in .
Let , and be positive integers with and . Our aim is to reduce multiplication in to multiplication in . In the applications in subsequent sections, will be a divisor of with many small factors, so that we can multiply efficiently in using FFTs over .
Suppose that is represented as , for some irreducible of degree . The reduction relies on the existence of an element such that . (This is the analogue of the real th root of appearing in the usual Crandall–Fagin algorithm, which was originally formulated over .) It is easy to see that such may not always exist; for example, there is no cube root of in . Nevertheless, the next result shows that if is large enough, then we can always find some modulus for which a suitable exists.
Proof. We first observe that if is a primitive root, then cannot lie in a proper subfield of . (This property is independent of .) Indeed, if for some proper divisor of , then every th power in lies in . This contradicts the fact that the number of th powers in is at least .
Now we give an algorithm for computing and . We start by using Lemma 3.1 to compute an irreducible of degree in time , and we temporarily represent as .
Using Lemma 3.2, we may construct a list of candidates for primitive roots of (see the proof of Lemma 3.3). For each candidate , we compute and its powers . If these are linearly dependent over then belongs to a proper subfield (and so, as shown above, is not actually a primitive root). Thus we must eventually encounter some for which they are linearly independent. We take to be the minimal polynomial of this , which may be computed by finding a linear relation among . Let be the isomorphism that sends to . The matrix of with respect to the standard bases and is given by the coefficients of . The inverse of this matrix yields the matrix of . We then set , so that .
For each candidate , the cost of computing the necessary powers of is , and the various linear algebra steps require time using classical matrix arithmetic.
In the remainder of this section, we fix some and as in Proposition 6.1, and assume that is represented as . Suppose that we wish to compute the product of . The presentation here closely follows that of [18, Section 9.2]. We decompose and as
(6.1) 
where
Notice that takes only two possible values: or .
For , let
so that . For any , define as follows. Choose so that lies in the interval , and put
From (6.2), we have
Since the left hand side lies in the interval , this shows that . Now, since , we have
where
Since and similarly for , we have . Note that we may recover from in time , by a standard overlapadd procedure (provided that ).
Now, regarding and as elements of by sending to , define polynomials by and for , and let be their (cyclic) product. Then
coincides with the reinterpretation of as an element of . Moreover, we may recover unambiguously from , as and . Altogether, this shows how to reduce multiplication in to multiplication in .
Remark
Consider the problem of computing products with , i.e., products with one fixed operand. Denote the cost of this operation by . Our algorithm for this problem will perform forward DFTs and inverse DFTs, so it is convenient to introduce the normalisation
This is welldefined since clearly . Roughly speaking, may be thought of as the notional cost of a single DFT.
The problem of multiplying two polynomials in of degree less than may be reduced to the above problem by using zeropadding, i.e., taking and . Since , we obtain . Thus it suffices to obtain a good bound for .
The next theorem gives the core of the new algorithm for the case that is large relative to .
uniformly for .
Proof. We wish to compute products with , for some sufficiently large . We assume throughout that .
Choose parameters. Using Theorem 4.1, we obtain integers and with
and so that , , and is smooth. We choose long and short target transform lengths and . For large enough we then have , so we may apply Theorem 4.6. This yields smooth integers , with known factorisations, such that divides and lies in the range , and such that for all . Finally we set , so that also divides , and we put . All of these parameters may be computed in time .
For large , we observe that the following estimates hold. First, we have
Since , it follows that
and also that . We have , so
To estimate , note that and , so
Since we have
and as , we similarly obtain
Crandall–Fagin reduction. Let us check that the hypotheses of section 6.1 are satisfied, to enable the reduction to multiplication in . We certainly have and . For Proposition 6.1, observe that , so for large . Thus we obtain an irreducible of degree , and an element such that , in time . Each multiplication in has cost (see section 3).
Computing the sequences and costs , and computing the sequence costs . The initial splitting and final overlapadd phases require time , and the multiplications by and cost .
Long transforms. The factorisation of is known, and divides , so by Lemma 3.3 we may compute a primitive th root of unity in time .
We will multiply in by using DFTs with respect to . The table of roots may be computed in time . In a moment, we will describe an algorithm for computing a “short” DFT of length with respect to ; we then use the algorithm for the main transform of length (see section 2.3). The corresponding twiddle factor tables may be extracted in time .
Let denote the complexity of , and for let denote the cost of performing independent DFTs of length using . Then by (2.4) we have
The last term is simply .
Bluestein conversion. We now begin constructing , assuming that independent transforms are sought. We first use Bluestein's algorithm (section 2.5) to convert each DFT of length to a multiplication in . We must check that is invertible in if is even; indeed, if is even, then so is , so . The Bluestein conversion contributes to the cost of .
We must also compute a suitable table of roots, once at the top level. We first extract the table from the top level table in time , and then sort them into the correct order (and perform any necessary additions) in time . Over all the cost is .
Kronecker substitution. We finally convert each multiplication in to a multiplication in using Kronecker substitution (see section 2.6). The latter multiplications have cost , since one argument is fixed. After the multiplications, we must also perform divisions by to recover the results in , at cost . Consolidating the estimates for the Bluestein conversion and Kronecker substitution, we have
For , this becomes
Conclusion. Summing over yields
Since
and
this becomes
Let , so that . Since
and , we get
To compute the desired products, we must execute forward transforms, and inverse transforms. Each product also requires pointwise multiplications in and multiplications by . These have cost , which is absorbed by the term. Thus we obtain
Dividing by , taking suprema over , and using (7.2) and (7.3), yields the bound (7.1).
Finally, since
we have and hence for some constant and all large . We may then take .
Now we may prove the main theorem announced in the introduction.
Proof of Theorem 1.1. For define
It suffices to prove that , uniformly in and .
Let and be as in Theorem 7.1. Increasing if necessary, we may assume that for , and that .
Let for each . We will consider the regions and separately. First consider the case . There are only finitely many primes , so we may assume that and that . In this region we use Kronecker substitution: by (2.5) and (1.1) we have
and since
we get uniformly for . In fact, this even shows that uniformly for .
Now consider the case . Here we invoke Theorem 7.1 to obtain absolute constants such that for every , there exist and such that
Set . Applying Proposition 5.3 with , , , we find that uniformly for . Since and , we conclude that uniformly for .
Recall that in [18] we established the bound with for the complexity of integer multiplication. We also proved that this can be improved to , assuming the following slight weakening of the Lenstra–Pomerance–Wagstaff conjecture on the distribution of Mersenne primes [28, 35] (see [18, Section 9] for further discussion).
The source of the (conjectural) speedup is as follows. The algorithm of [18, Section 6] computes DFTs over , and so we encounter the “short product obstruction”. Namely, to compute the product of two real numbers with significant bits using FFT algorithms, we are forced to compute a full product of two bit integers, and then truncate the result to bits. To achieve , we replaced the coefficient ring by the finite field , where is a Mersenne prime. Multiplication modulo is a “cyclic product” rather than a short product, and this saves a factor of two at each recursion level.
The aim of this section is to outline a credible strategy for achieving the same improvement, from to , in the context of multiplication in , at least under certain plausible numbertheoretic hypotheses.
Consider the problem of multiplying polynomials in of degree less than . In the Kronecker substitution region, i.e., for , we can of course achieve if we assume Conjecture 8.1.
Now consider the region where is much larger than . In the algorithm of section 7, we reduced the multiplication problem to computing products in , where for some monic irreducible of degree , and where is a “short” transform length. These multiplications were in turn converted to multiplications in via Kronecker substitution. It is exactly this factor of two in zeropadding that we wish to avoid.
Taking our cue from the integer case, we observe that if the modulus is of a particularly special form, then this factor of two can be eliminated. Indeed, suppose that
Then is isomorphic to , which is a quotient of . Assuming further that , the latter ring is isomorphic to . In other words, we can reduce a cyclic convolution over of length to a cyclic convolution over of length , rather than length . This saves a factor of two at this recursion level; if we can manage something similar at every recursion level, we reduce from to .
The essential question is therefore how to choose so that is irreducible over . The following lemma gives a simple numbertheoretic characterisation of such .
Proof. If is a nontrivial factor of , then is divisible by , so has the nontrivial factor . On the other hand, suppose that is prime. If , then is not irreducible; otherwise, the number of irreducible factors of over is exactly , where is the order of in [36, Theorem 2.13 and Proposition 2.14] (this last statement is sometimes called the “cyclotomic reciprocity law”).
A first attempt to reach might run as follows. First choose as in the proof of Theorem 7.1, so that has many roots of unity of smooth order. Then use the same multiplication algorithm as before, but now working over , where is the smallest positive multiple of such that is prime and such that is a primitive root modulo .
Unfortunately this plan does not quite work, for reasons that we now explain.
First, if , there may be a trivial obstruction to the existence of suitable . For example, if and , then for any we have by quadratic reciprocity, so cannot be a primitive root modulo . The only way to avoid this seems to be to insist that not be divisible by in the first place. Therefore we propose the following strengthening of Lemma 4.2.
There exist computable absolute constants such that for any integer and any prime we have
It seems likely that this conjecture should be accessible to specialists in analytic number theory. Experimentally, the “missing” factor does not seem to have much effect on the propensity of to have many divisors such that is prime. We could not find a quick way to prove the conjecture directly from Lemma 4.2.
The second problem is more serious. A special case of Artin's conjecture on primitive roots (see [25] for a survey) asserts that for any prime , there are infinitely many primes for which is a primitive root modulo . Unfortunately, there is not even a single prime for which Artin's conjecture is known to hold! Therefore, in general we cannot prove existence of a suitable .
However, we do have the following result of Lenstra, which states that the only obstruction to the existence of is the trivial one noted above, provided we are willing to accept GRH.
Proof. This is a special case of [23, Theorem 8.3]. (For a description of exactly which number fields are supposed to satisfy GRH for this result to hold, see [23, Theorem 3.1].)
The third problem is that existence of is not enough; we also need a bound on its size, as a function of and . The literature provides little guidance on this question. Under the hypotheses of Lemma 8.4 (including GRH), it is known that the number of primes such that and for which is a primitive root modulo grows as
(8.1) 
for some positive (see for example [24, Theorem 2]). Both and the implied big constant depend on and . While the value of is reasonably well understood, we could not find in the literature a similarly precise description of the implied constant, so we have been unable to derive a bound for .
In the interests of getting our conjectural algorithm off the ground, we make the following guess. Define to be the smallest prime such that and such that is a primitive root modulo (assuming that such exists).
We have deliberately given an absurdly weak formulation of this conjecture, so as to provide the largest possible target for analytic number theorists. For example, it would be enough to prove that for all . We suspect that the conjecture is accessible under GRH; presumably a proof would require analysing the implied constant in (8.1).
Remark
Proof of Theorem 1.2. The proof is similar to that of Theorem 1.1. We will only give a sketch, highlighting the main differences.
In the region , we use Kronecker substitution together with [18, Theorem 2], which states that under Conjecture 8.1.
Now assume that . We will pursue a strategy similar to Theorem 7.1, recursing from down to an exponentially smaller . Assume that we wish to compute products in with one fixed operand.
Using an appropriate modification of Theorem 4.1, in which the use of Lemma 4.2 is replaced by Conjecture 8.3, we find integers and such that , (or if ), , , and is smooth.
Now we wish to apply Conjecture 8.5 to construct suitable . However, that conjecture requires that . If we have instead , we simply use the algorithm of Theorem 7.1. This yields the expansion factor , but only for recursion levels (depending on ), since . So it is permissible to assume that , losing only a constant factor in the main complexity bound for the last few recursion levels.
Applying Conjecture 8.5 and Lemma 8.2, we obtain a positive multiple of , such that is prime, , and is irreducible in . We will take as our model for ; note that .
Notice that for some logarithmically slow function Indeed, there exists and such that and for large . Increasing if necessary, we get , the latter being a logarithmically slow function of .
In the proof of Theorem 7.1, we selected the transform length first, and then chose to finetune the total bit size. Here we have less control over , so we must use a different strategy. Choose target long and short transform lengths and . Applying Theorem 4.6, we obtain smooth integers such that divides and for each , and such that for large . Let , and let be the smallest prime greater than and different to . By [4] we have , so .
We now apply the Crandall–Fagin algorithm to reduce multiplication in to multiplication in . The prerequisite is satisfied (since is even). Instead of using Proposition 6.1 to construct satisfying , we may simply take . The required modular inverse exists, as and are distinct primes, and since is smooth.
We now take advantage of the isomorphism , where . We multiply in by using DFTs over : one DFT of length over reduces to DFTs of length over . The latter are handled by decomposing into short transforms of length , which are in turn converted to multiplications in via Bluestein's algorithm. Finally – and this is where all the hard work pays off – each such multiplication reduces to a multiplication in , since . The multiplications in itself are handled using the algebraic Schönhage–Strassen algorithm.
We omit the rest of the complexity argument, which is essentially the same as that of Theorem 7.1 and Theorem 1.1. We mention only that was chosen sufficiently large that the multiplications in and in make a negligible contribution overall.
In this section we outline some directions along which the results in this paper can be extended. We also provide some hints concerning the practical usefulness of the new ideas. Our treatment is more sketchy and we plan to provide more details in a forthcoming paper.
Recall that denotes the cost of multiplying polynomials in of degree less than , where we assume that some model for has been fixed in advance.
uniformly for all and all primes . Assuming Conjectures 8.1, 8.3 and 8.5, we may replace by .
Indeed, we saw in section 3 that
where denotes the cost of dividing a polynomial of degree less than by . Having established the bound , it is now permissible to assume that is increasing, so the usual argument for Newton iteration shows that . Using again that is increasing, we obtain .
For any prime and any integer , denote by the bit complexity of multiplying polynomials in .
uniformly for all and all primes . Assuming Conjectures 8.1, 8.3 and 8.5, we may replace by .
For we may simply use Kronecker substitution. For we must modify the algorithm of section 7. Recall that in that algorithm we reduced multiplication in to multiplication , where is a transform length dividing . Our task is to define a ring , analogous to , so that multiplication in can be reduced to multiplication in .
Let be the natural projection , and write also for the corresponding map . Let be any monic irreducible polynomial of degree , and let be an arbitrary lift via , also monic of degree . Then is irreducible, and we will take .
Any primitive th root of unity has a unique lift to a principal th root of unity . (This can be seen, for example, by observing that must be the image in of a Teichmüller lift of in the adic field whose residue field is .) Given , we may compute efficiently using fast Newton lifting [11, Section 12.3].
Moreover, if we choose and as in Proposition 6.1, so that , then fast Newton lifting can also be used to obtain such that in .
For any , denote by the bit complexity of multiplying polynomials in .
We use the isomorphism , where is the prime decomposition of . The cost of converting between the and representations is [17, Section 10.3]. By Theorem 9.2 we get
The first term dominates if . If we may simply use Kronecker substitution.
Until now we have considered only complexity bounds in the Turing model. The new techniques also lead to improved bounds in algebraic complexity models. In what follows, is always a commutative ring with identity.
For the simplest case, first assume that is an algebra for some prime . In the straightline program model [9, Chapter 4], we count the number of additions and subtractions in , the number of scalar multiplications (multiplications by elements of ), and the number of nonscalar multiplications (multiplications in ). As pointed out in the introduction, the best known bound for the total complexity was previously , which was achieved by the Cantor–Kaltofen algorithm.
The idea of the proof is to use the same algorithm as in section 7, but instead of switching to Kronecker substitution when we reach , we simply recurse all the way down to . The role of the extension is played by , where is monic and irreducible of degree (here we have used the fact that may be viewed as a subring of , since contains an identity element and hence a copy of ). Thus we reduce multiplication in to multiplication in . The latter multiplication may be handled by DFTs over , since any primitive th root of unity in corresponds naturally to a principal th root of unity in .
The bound covers the cost of the DFTs, which are accomplished entirely using additions, subtractions and scalar multiplications. Nonscalar multiplications are needed only for the pointwise multiplication step. To explain the bound, we observe that at each recursion level the total “data size” grows by a factor of four: one factor of two arises from the Crandall–Fagin splitting, and another factor of two from the Kronecker substitution.
Note that in the straightline program model, we give a different algorithm for each . Thus all precomputed objects, such as the defining polynomial and the required roots of unity, are obtained free of cost (they are built directly into the structure of the algorithm). The uniformity in follows from the fact that the bounds for and in Theorem 4.1 are independent of , although of course the algorithm will be different for each . For each fixed , we may use the same straightline program for all algebras .
Assuming Conjectures 8.3 and 8.5, the bounds may be improved to respectively and . However, in this case we lose uniformity in , due to the issue raised in Remark 8.6.
Theorem 9.4 can be generalised to algebras, along the same lines as section 9.2. It is also possible to handle algebras for any integer , i.e., rings of finite characteristic, but we cannot proceed exactly as in section 9.3, because the straightline model has no provision for a “reduction modulo ” operation. Instead, we use a device introduced in [10]. Suppose that and that is now a algebra. For each we may construct a straightline program that takes as input polynomials of degree less than , where is any algebra, and computes . By replacing every constant in by a compatible constant in , we obtain a straightline program that takes as input of degree less than , and computes such that . By the Chinese remainder theorem we may choose such that and for each . The linear combination is then equal to in . We conclude that we may multiply polynomials in using additions, subtractions and scalar multiplications (by elements of ), and nonscalar multiplications (i.e., multiplications in ). These bounds are not uniform in , but for each they are uniform over all algebras.
If we wish to take into account the cost of precomputations, and give a single algorithm that works uniformly for all and , we may use a more refined complexity model such as the BSS model [6]. In this model, a machine over is, roughly speaking, a Turing machine in which the tape cells hold elements of . Actually, we need a multitape version of the model described in [6]. The machine can perform arithmetic operations on elements of in unit time, but can also manipulate data such as index variables in the same way as a Turing machine, and must deal with data layout in the same way as a Turing machine. In this model we obtain similar bounds to Theorem 9.4, but without uniformity in . Alternatively, we could obtain bounds uniform in if we added extra terms to account for the precomputations.
Another point of view in Theorem 9.4 is that we have described a new evaluationinterpolation strategy for polynomials over an algebra . We refer the reader to [21, Sections 2.1–2.4] for classical examples of evaluationinterpolation schemes, and [16] for algorithms specific to finite fields. Such schemes are characterised by two quantities: the evaluation/interpolation complexity and the number of evaluation points. The new algorithms yield the bounds and , and these can be used to prove complexity bounds for problems more general than polynomial multiplication. For example, we can multiply matrices with entries in using ring operations, where is an exponent of matrix multiplication. One of the main advantages of our algorithms is that is almost linear, contrary to synthetic FFT methods [10, 31] derived from Schönhage–Strassen multiplication [32], which achieve only .
In the setting of bilinear complexity [9, Chapter 14], the new algorithms do not improve asymptotically on the best known bounds. For example, it is known [27] that for any there exist linear maps and polynomials for , with , such that for all . The new method yields the asymptotically inferior bound . Nevertheless, these bounds are asymptotic and do not take into account that our new algorithm would rather work over (say) instead of . In practice, the new method might therefore outperform the bilinear algorithms from [27].
The new complexity bounds have a quite theoretic flavour: in practice, slowly growing functions such as and really behave as constants. Furthermore, for concrete applications, it would be more relevant to redefine as
With this definition, we observe that for all feasible values of in the foreseeable future. Nevertheless, we think that variants of the ideas in this paper could still be useful in practice.
Let us briefly sketch how an efficient multiplication algorithm over could be implemented. Various modern processors offer a special instruction for the multiplication of two polynomials in . The practical translation of the results from section 4 is to work over the field where . This has two advantages. First, if we represent elements of by their residues modulo , i.e., using a redundant representation, then each multiplication in requires a single hardware multiply instruction, followed by simple shift and XOR instructions. Second, there exist primitive roots of unity of high smooth order .
Transforms of small prime lengths over can be implemented using specialised codelets, similar to those implemented in FFTW3 [14]. Roughly speaking, the constant factor of a radix DFT is . In our case, , so specialised codelets should remain reasonably efficient for . Consequently, through the mere use of such small radices, we obtain efficient algorithms for computing DFTs of lengths dividing . To handle larger primes, we may use any algorithm for converting DFTs to convolutions. Rather than use Bluestein's algorithm as in the theoretical portion of the paper, it is more economical in this case to use Rader's algorithm, which reduces a DFT of length to a cyclic convolution of length , together with additions in (see also Remark 2.4).
More precisely, consider the remaining transform lengths , , , , and . We have , , , , and , so in each case, the DFT reduces to a twodimensional convolution of size , where is a divisor of and . Such a convolution in turn reduces to DFTs of length together with convolutions of length . The latter cannot be handled using DFTs over , since there are no roots of unity of order , or , but in any case they are all very short convolutions and could be performed directly, or by using Karatsuba's algorithm.
Transforms of other lengths dividing may be reduced to these cases via the Cooley–Tukey algorithm. Finally, to handle arbitrary input lengths efficiently, we may use a mixedradix generalisation of the truncated Fourier transform [19, 20].
We also reemphasise the fact that DFTs of this kind can be used as an evaluationinterpolation technique. This approach is for instance attractive for the multiplication of polynomial matrices over .
We have not yet implemented any of these algorithms. We expect that much finetuning will be necessary to make them most effective in practice. We intend to report on this issue in a future work.
L. M. Adleman and H. W. Lenstra Jr. Finding irreducible polynomials over finite fields. In Proceedings of the eighteenth annual ACM symposium on Theory of computing, STOC '86, pages 350–355. ACM Press, 1986.
L. M. Adleman, C. Pomerance, and R. S. Rumely. On distinguishing prime numbers from composite numbers. Annals of Math., 117:173–206, 1983.
A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The design and analysis of computer algorithms. AddisonWesley, 1974.
R. C. Baker, G. Harman, and J. Pintz. The difference between consecutive primes. II. Proc. London Math. Soc. (3), 83(3):532–562, 2001.
L. I. Bluestein. A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics, 18(4):451–455, 1970.
L. Blum, F. Cucker, M. Shub, and S. Smale. Complexity and real computation. SpringerVerlag, 1998.
A. Bostan, P. Gaudry, and É. Schost. Linear recurrences with polynomial coefficients and application to integer factorization and CartierManin operator. SIAM J. Comput., 36:1777–1806, 2007.
R. P. Brent. Fast multipleprecision evaluation of elementary functions. J. Assoc. Comput. Mach., 23(2):242–251, 1976.
P. Bürgisser, M. Clausen, and M. A. Shokrollahi. Algebraic complexity theory. SpringerVerlag, 1997.
D. G. Cantor and E. Kaltofen. On fast multiplication of polynomials over arbitrary algebras. Acta Infor., 28:693–701, 1991.
H. Cohen, G. Frey, R. Avanzi, Ch. Doche, T. Lange, K. Nguyen, and F. Vercauteren, editors. Handbook of elliptic and hyperelliptic curve cryptography. Discrete Mathematics and its Applications. Chapman & Hall/CRC, Boca Raton, FL, 2006.
J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Computat., 19:297–301, 1965.
R. Crandall and B. Fagin. Discrete weighted transforms and largeinteger arithmetic. Math. Comp., 62(205):305–324, 1994.
M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proc. IEEE, 93(2):216–231, 2005.
M. Fürer. Faster integer multiplication. In Proceedings of the ThirtyNinth ACM Symposium on Theory of Computing, STOC 2007, pages 57–66, New York, NY, USA, 2007. ACM Press.
S. Gao and T. Mateer. Additive fast Fourier transforms over finite fields. IEEE Trans. Inform. Theory, 56(12):6265–6272, 2010.
J. von zur Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, 2nd edition, 2002.
D. Harvey, J. van der Hoeven, and G. Lecerf. Even faster integer multiplication. Technical report, HAL, 2014. http://hal.archivesouvertes.fr.
J. van der Hoeven. The truncated Fourier transform and applications. In J. Gutierrez, editor, Proc. ISSAC 2004, pages 290–296, Univ. of Cantabria, Santander, Spain, July 4–7 2004.
J. van der Hoeven. Notes on the Truncated Fourier Transform. Technical Report 20055, Université ParisSud, Orsay, France, 2005.
J. van der Hoeven. Newton's method and FFT trading. J. Symbolic Comput., 45(8):857–878, 2010.
D. E. Knuth. The art of computer programming, volume 3: Sorting and Searching. AddisonWesley, Reading, MA, 1998.
H. W. Lenstra, Jr. On Artin's conjecture and Euclid's algorithm in global fields. Invent. Math., 42:201–224, 1977.
P. Moree. On primes in arithmetic progression having a prescribed primitive root. J. Number Theory, 78(1):85–98, 1999.
P. Moree. Artin's primitive root conjecture—a survey. Integers, 12(6):1305–1416, 2012.
C. H. Papadimitriou. Computational Complexity. AddisonWesley, 1994.
J. Pieltant and H. Randriam. New uniform and asymptotic upper bounds on the tensor rank of multiplication in extensions of finite fields. Math. Comp., 2014. To appear.
C. Pomerance. Recent developments in primality testing. Math. Intelligencer, 3(3):97–105, 1980/81.
C. M. Rader. Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE, 56(6):1107–1108, June 1968.
K. R. Rao, D. N. Kim, and J. J. Hwang. Fast Fourier Transform  Algorithms and Applications. Signals and Communication Technology. SpringerVerlag, 2010.
A. Schönhage. Schnelle Multiplikation von Polynomen über Körpern der Charakteristik 2. Acta Infor., 7:395–398, 1977.
A. Schönhage and V. Strassen. Schnelle Multiplikation großer Zahlen. Computing, 7:281–292, 1971.
V. Shoup. New algorithms for finding irreducible polynomials over finite fields. Math. Comp., 189:435–447, 1990.
V. Shoup. Searching for primitive roots in finite fields. Math. Comp., 58:369–380, 1992.
S. Wagstaff. Divisors of Mersenne numbers. Math. Comp., 40(161):385–397, 1983.
L. C. Washington. Introduction to cyclotomic fields, volume 83 of Graduate Texts in Mathematics. SpringerVerlag, New York, second edition, 1997.