Faster polynomial multiplicationover finite fields

Given a ring

, a basic problem in complexity theory is to find an upper bound for the cost of multiplying two polynomials in

of degree less than

. Several complexity models may be considered. In algebraic complexity models, such as the straight-line program model [9, Chapter 4], we count the number of ring operations in

, denoted

. In this model, the best currently known bound

was obtained by Cantor and Kaltofen [10]. More precisely, their algorithm performs

additions and subtractions and

multiplications in

. Their bound may be viewed as an algebraic analogue of the Schönhage–Strassen result for integer multiplication [32] and generalises previous work by Schönhage [31]. These algorithms all rely on suitable incarnations of the fast Fourier transform (FFT) [12]. For details we refer the reader to [17, Chapter 8].

In this paper we are mainly interested in the case that

is the finite field

for some prime

. The standard bit complexity model based on deterministic multitape Turing machines [26] is more realistic in this setting, as it takes into account the dependence on

. We write

for the bit complexity of multiplying two polynomials in

of degree less than

. Two basic approaches are known for obtaining good asymptotic bounds for

is not too large compared to

, then one may use Kronecker substitution, which converts the problem to integer multiplication by packing the coefficients into large integers. Let

denote the bit complexity of

-bit integer multiplication. Throughout the paper, we make the customary assumption that

is increasing, and for convenience we define

for

. According to our recent sharpening [18] of Fürer's bound [15], we may take

However, Kronecker substitution leads to inferior bounds when

is large compared to

, due to coefficient growth in the lifted product in

. In this situation the best existing method is the algebraic Schönhage–Strassen algorithm (i.e., the Cantor–Kaltofen algorithm). In the Turing model this yields the bound

Since Fürer's breakthrough result, there has been a gap between what is known for integer multiplication and for multiplication in

. Namely, it has remained an open question whether the factor

appearing in (1.3) can be improved to

for some

. It seems that Fürer's construction does not work in

1.1.Our contributions

The main result of this paper is a new algorithm that completely closes the gap mentioned above.

The bound follows immediately via Kronecker substitution if

. The bulk of the paper concerns the reverse case where

is large compared to

. For example, if

, the bound becomes simply

The basic idea of the new algorithm is as follows. We first construct a special extension

, where

is exponentially smaller than

, and for which

has many small divisors. These divisors are themselves exponentially smaller than

, yet they are so numerous that their product is comparable to

. We now convert the given multiplication problem in

to multiplication in

, by cutting up the polynomials into small pieces, and we then multiply in

by using FFTs over

. Applying the Cooley–Tukey method to the small divisors of

, we decompose the FFTs into transforms of exponentially shorter lengths (not necessarily powers of two). We use Bluestein's method to convert each short transform to a convolution over

, and then use Kronecker substitution (from bivariate to univariate polynomials) to convert this to multiplication in

. These latter multiplications are then handled recursively. We continue the recursion until

is comparable to

, at which point we switch to using Kronecker substitution (from polynomials to integers).

In many respects this approach is very similar to the algorithm for integer multiplication introduced in [18], but there are various additional technical issues to address. We recommend [18] as a gentle introduction to the main ideas.

Of these conjectures, Conjecture 8.3 seems to be in reach for specialists in analytic number theory; Conjecture 8.5 is at least as hard as Artin's conjecture on primitive roots, but may be tractable under the generalised Riemann hypothesis (GRH); and Conjecture 8.1 is almost as strong as the Lenstra–Pomerance–Wagstaff conjecture on the distribution of Mersenne primes, and seems to be inaccessible at present.

1.2.Outline of the paper

The paper is structured as follows. In section 2, we start with a survey of relevant basic techniques: discrete Fourier transforms (DFTs) and FFTs, Bluestein's chirp transform, and Kronecker substitution. Most of this material is repeated from [18, Section 2] for the convenience of the reader; however, sections 2.5 and 2.6 differ substantially.

In section 3, we recall basic complexity results for arithmetic in finite fields. In particular, we consider the construction of irreducible polynomials in

, algorithms for finding roots of unity in

, and the cost of arithmetic in

and

In section 4, we show how to construct special extensions of

whose multiplicative group has a large subgroup of highly smooth order, i.e., is divisible by many small integers. The main tool is [2, Theorem 3].

Section 5 gives complexity bounds for functions that satisfy recurrence inequalities involving postcompositions with “logarithmically slow” functions. The prototype of such an inequality is

, where

and

are constants. The definitions and theorems are duplicated from [18, Section 5]; for the proofs, see [18].

To minimise the constant

in the bound

(for

large relative to

), we need one more tool: in section 6, we present a polynomial analogue of the Crandall–Fagin convolution algorithm [13]. This allows us to convert a cyclic convolution over

of length

, where

is arbitrary, into a cyclic convolution over

of somewhat smaller length

, where

is prescribed and where

. (We can still obtain

without using this Crandall–Fagin analogue, but we do not know how to reach

without it.)

Section 7 is devoted to the proof of Theorem 1.1. Section 8 gives the proof of Theorem 1.2, and discusses the three number-theoretic hypotheses of that theorem.

The last section offers some final notes and suggested directions for generalisation. We first quickly dispense with the bit complexity of multiplication in

for prime powers

, and in

for arbitrary integers

. We then sketch some algebraic complexity bounds for polynomial multiplication over

-algebras and

-algebras, especially in the straight-line program model. Our techniques also give rise to new strategies for polynomial evaluation-interpolation over

. This may for instance be applied to the efficient multiplication of polynomial matrices over

. Although we have not implemented any of our algorithms yet, we conclude by a few remarks on their expected efficiency in practice.

Notations. We use Hardy's notations

for

, and

for

and

. The “soft-Oh” notation

means that

(see [17, Chapter 25, Section 7] for details). The symbol

denotes the set of non-negative real numbers, and

denotes

. For a ring

and

, we write

We will write

. In expressions like

, we tacitly assume that the function is adjusted so as to take positive values for small primes such as

Acknowledgments. We would like to thank Karim Belabas for directing us to the paper [2], and Igor Shparlinski for answering some analytic number theory questions.

2.Survey of classical tools

This section recalls basic facts on Fourier transforms and related techniques used in subsequent sections. For more details and historical references we refer the reader to standard books on the subject such as [3, 9, 17, 30].

2.1.Arrays and sorting

In the Turing model, we have available a fixed number of linear tapes. An

array

-bit elements is stored as a linear array of

bits. We generally assume that the elements are ordered lexicographically by

, though this is just an implementation detail.

What is significant from a complexity point of view is that occasionally we must switch representations, to access an array (say 2-dimensional) by “rows” or by “columns”. In the Turing model, we may transpose an

matrix of

-bit elements in time

, using the algorithm of [7, Appendix]. Briefly, the idea is to split the matrix into two halves along the “short” dimension, and transpose each half recursively.

We will also require more complex rearrangements of data, for which we resort to sorting. Suppose that

is a totally ordered set, whose elements are represented by bit strings of length

, and suppose that we can compare elements of

in time

. Then an array of

elements of

may be sorted in time

using merge sort [22], which can be implemented efficiently on a Turing machine.

2.2.Discrete Fourier transforms

Let

be a commutative ring with identity and let

. An element

is said to be a principal -th root of unity if

and

for all

. In this case, we define the discrete Fourier transform (or DFT) of an

-tuple

with respect to

to be

where

Remark 2.1. In all of the new algorithms introduced in this paper, we actually work over a field, whose characteristic does not divide

. In this setting, the concept of principal root of unity coincides with the more familiar primitive root of unity.

2.3.The Cooley–Tukey FFT

Let

be a principal

-th root of unity and let

where

. Then

is a principal

-th root of unity and

is a principal

-th root of unity. Moreover, for any

and

, we have

and

are algorithms for computing DFTs of length

and

, we may use (2.2) to construct an algorithm

for computing DFTs of length

as follows.

For each

, the sum inside the brackets corresponds to the

-th coefficient of a DFT of the

-tuple

with respect to

. Evaluating these inner DFTs requires

calls to

. Next, we multiply by the twiddle factors

, at a cost of

operations in

. (Actually, fewer than

multiplications are required, as some of the twiddle factors are equal to

. This optimisation, while important in practice, has no asymptotic effect on the algorithms discussed in this paper.) Finally, for each

, the outer sum corresponds to the

-th coefficient of a DFT of an

-tuple in

with respect to

. These outer DFTs require

calls to

Denoting by

the number of ring operations needed to compute a DFT of length

, and assuming that we have available a precomputed table of twiddle factors, we obtain

The corresponding algorithm is denoted

. The

operation is neither commutative nor associative; the above expression will always be taken to mean

The above discussion requires several modifications in the Turing model. Assume that elements of

are represented by

bits.

First, for

, we must add a rearrangement cost of

to efficiently access the rows and columns for the recursive subtransforms (see section 2.1). For the general case

, the total rearrangement cost is bounded by

Second, we will sometimes use non-algebraic algorithms to compute the subtransforms, so it may not make sense to express their cost in terms of

. The relation (2.3) therefore becomes

where

is the (Turing) cost of a transform of length

over

, and where

is the cost of a single multiplication in

Finally, we point out that

requires access to a table of twiddle factors

, ordered lexicographically by

, for

. Assuming that we are given as input a precomputed table of the form

, we must show how to extract the required twiddle factor table in the correct order. We first construct a list of triples

, ordered by

, in time

; then sort by

in time

(see section 2.1); then merge with the given root table to obtain a table

, ordered by

, in time

; and finally sort by

in time

. The total cost of the extraction is thus

The corresponding cost for

is determined as follows. Assuming that the table

is given as input, we first extract the subtables of (

)-th roots of unity for

in time

. Extracting the twiddle factor table for the decomposition

then costs

; the total over all

is again

Remark 2.2. An alternative approach is to compute the twiddle factors directly in the correct order. When working over

, as in [18, Section 3], this requires a slight increase in the working precision. Similar comments apply to the root tables used in Bluestein's algorithm in section 2.5.

2.4.Fast Fourier multiplication

Let

be a principal

-th root of unity in

and assume that

is invertible in

. Consider two polynomials

and

. Let

be the polynomial defined by

where the product of the DFTs is taken pointwise. By construction, we have

, which means that

for all

. The product

and

modulo

also satisfies

for all

. Consequently,

, whence

For polynomials

with

and

, we thus obtain an algorithm for the computation of

modulo

using at most

operations in

. Modular products of this type are also called cyclic convolutions. If

, then we may recover the product

from its reduction modulo

. This multiplication method is called FFT multiplication.

If one of the arguments (say

) is fixed and we want to compute many products

(or cyclic convolutions) for different

, then we may precompute

, after which each new product

can be computed using only

operations in

2.5.Bluestein's chirp transform

We have shown above how to multiply polynomials using DFTs. Inversely, it is possible to reduce the computation of DFTs — of arbitrary length, not necessarily a power of two — to polynomial multiplication [5], as follows.

Let

be a principal

-th root of unity and consider an

-tuple

. First consider the case that

is odd. Let

In other words, the computation of a DFT of odd length

reduces to a cyclic convolution product of the same length

, together with

additional operations in

. Notice that the polynomial

is fixed and independent of

in this product.

Now suppose that

is even. In this case we require that

be invertible in

, and that

. Let

, and put

Again, the DFT reduces to a convolution of length

, together with

additional operations in

The only complication in the Turing model is the cost of extracting the

and

in the correct order. For example, consider the

in the case that

is odd. Given as input a precomputed table

, we may extract the

in time

by applying the strategy from section 2.3 to the pairs

for

. The other sequences are handled similarly. For the

in the case that

is even, we also need to perform

additions in

; assuming that additions in

have cost

, this is already covered by the above bound.

Remark 2.3. Bluestein's original formulation used the weights

. This has two drawbacks in our setting. First, it requires the presence of

-th roots of unity, which may not exist in

. Second, if

is odd, it leads to negacyclic convolutions, rather than cyclic convolutions. The variants used here avoid both of these problems.

Remark 2.4. It is also possible to give variants of the new multiplication algorithms in which Bluestein's transform is replaced by a different method for converting DFTs to convolutions, such as Rader's algorithm [29].

2.6.Kronecker substitution

Multiplication in

may be reduced to multiplication in

using the classical technique of Kronecker substitution [17, Corollary 8.27]. More precisely, let

and suppose that we are given two polynomials

of degree less than

. Let

be lifts of these polynomials, with coefficients

and

satisfying

. Then for the product

we have

, and the coefficients of

may be read off the integer product

where

. We deduce the coefficients of

by dividing each

. This shows that

. Using the assumption that

is increasing, we obtain

If we also assume that

, i.e., that the degree is not too large, then this becomes simply

A second type of Kronecker substitution reduces bivariate polynomial multiplication to the univariate case. Indeed, let

and

, and suppose that

, where

and

. We may then recover

from the univariate product

. Note that

and

have degree less than

, so the cost of the bivariate product is

The same method works for computing cyclic convolutions: to multiply

, the same substitution leads to a product in

. The cost is thus

, where

denotes the cost of a multiplication in

3.Arithmetic in finite fields

Let

be a prime and let

. In this section we review basic results concerning arithmetic in

and

We assume throughout that

is represented as

for some irreducible monic polynomial

of degree

. Thus an element of

is represented uniquely by its residue modulo

, i.e., by a polynomial

of degree less than

The above complexity bound is very pessimistic in practice. Better complexity bounds are known if we allow randomised algorithms, or unproved hypotheses. For instance, assuming GRH, the bound reduces to

[1].

We now consider the cost of arithmetic in

, assuming that

is given. Addition and subtraction in

have cost

. For multiplication we will always use the Schönhage–Strassen algorithm. Denote by

the cost of multiplying polynomials in

by this method, i.e., using the polynomial variant [31] for the polynomials themselves, followed by the integer version [32] to handle the coefficient multiplications. Then we have

Of course, we could use the new multiplication algorithm recursively for these products, but it turns out that Schönhage–Strassen is fast enough, and leads to a simpler complexity analysis in section 7. Let

denote the cost of dividing a polynomial in

, returning the quotient and remainder. Using Newton's method [17, Chapter 9] we have

. Thus elements of

may be multiplied in time

Let

be a divisor of

. To compute a DFT over

of length

, we must first have access to a primitive

-th root of unity in

. In general it is very difficult to find a primitive root of

, as it requires knowledge of the factorisation of

. However, to find a primitive

-th root, Lemma 3.3 below shows that it is enough to know the factorisation of

. The construction relies on the following existence result.

Proof. Testing whether a given

is a primitive

-th root of unity reduces to checking that

and

for every prime divisor

. According to Lemma 3.2, it suffices to apply this test to

for

running over all monic polynomials of degree

, where

is as in the lemma. The number of candidates is at most

, and each candidate can be tested in time

using binary exponentiation.

Of course, we can do much better if randomised algorithms are allowed, since

is reasonably likely to be a primitive

-th root for randomly selected

Finally, we consider polynomial multiplication over

. Let

, and let

. Let

be their natural lifts, i.e., of degree less than

with respect to

. The bivariate product

may be computed using Kronecker substitution (section 2.6) in time

. Writing

, to recover

we must divide each

. Denoting by

the complexity of the original multiplication problem, we obtain

As in section 2.6, the same method may be used for cyclic convolutions. If

denotes the cost of multiplication in

, then we get

Remark 3.4. In the other direction, we can also reduce multiplication in

to multiplication in

by splitting the inputs into chunks of size

. This leads to the bound

for

. A variant of this procedure is developed in detail in section 6.

4.Preparing for DFTs of large smooth orders

The aim of this section is to prove the following theorem, which allows us to construct “small” extensions

containing “many” roots of unity of “low” order. Recall that a positive integer is said to be -smooth if all of its prime divisors are less than or equal to

For example,

and

. Note that

is always squarefree and

-smooth. We are interested in finding

for which

is unusually large; that is, for which

has many divisors

such that

happens to be prime. Most of the heavy lifting is done by the following remarkable result.

Proof. This is part of [2, Theorem 3]. (The threshold

could of course be replaced by any fixed power of

. It is stated this way in [2] because that paper is concerned with primality testing.)

Proof. We take

divides

, and otherwise

. In the former case we must have

, so in both cases

. To see that

, consider any prime divisor

. Then

, so

by Fermat's little theorem.

Remark 4.4. The integer

constructed in Lemma 4.3 only takes into account the structure of

, and ignores

itself. In practice,

will often have small factors other than those in

, possibly including repeated factors (which are never detected by

). For example,

, but

. In an implementation, one should always incorporate these highly valuable “accidental” factors into

. We will ignore them in our theoretical discussion.

Next we give a simple sieving algorithm that tabulates approximations of

for all

up to a prescribed bound.

Proof. Initialise a table of

with

for

. Since

, it suffices to set aside

bits for each

For each integer

, perform the following steps. First test whether

is prime, discarding it if not. Using trial division, the cost is

, so

overall. Now assume that

is found to be prime. Using a fast algorithm for computing logarithms [8], compute an integer

such that

, where

, in time

. Scan through the table, replacing

for those

divisible by

, i.e., every

-th entry, in time

. The total cost for each prime

, and there are

primes, so the overall cost is

. At the end, we divide each

, yielding the required approximations satisfying (4.1).

Proof of Theorem 4.1. We are given

and

as input. Applying Lemma 4.2 with

, we find that for large enough

there exists some

with

for large

. We may compute the prime factorisation of

by simply enumerating the primes

, and checking whether

divides

for each

. This can be done in time

The main multiplication algorithm depends on a reduction of a “long” DFT to many “short” DFTs. It is essential to have some control over the long and short transform lengths. The following result packages together the prime divisors of the abovementioned

, to obtain a long transform length

near a given target

, and short transform lengths

near a given target

Proof. Let

be the prime decomposition of

. Taking

minimal with

, we ensure that (a) and (b) are satisfied. At this stage we have

for all

. As long as the tuple

contains an entry

with

, we pick the two smallest entries

and

and replace them by a single entry

. Clearly, this does not alter the product

of all entries, so (a) and (b) continue to hold. Furthermore, as long as there exist two entries

and

with

and

, new entries

will always be smaller than

. Only at the very last step of the loop, the second smallest entry

might be larger than

. In that case, the product of the entry

with

is still bounded by

. This shows that condition (c) is satisfied at the end of the loop.

Determining

requires at most

multiplications of integers less than

. There are at most

iterations of the replacement loop. Each iteration must scan through at most

integers of bit size

and perform one multiplication on such integers.

5.Logarithmically slow recurrence inequalities

Let

be a smooth increasing function, for some

. We say that

is an iterator of

is increasing and if

For instance, the standard iterated logarithm

defined in (1.2) is an iterator of

. An analogous iterator may be defined for any smooth increasing function

for which there exists some

such that

for all

. Indeed, in that case,

is well-defined and satisfies (5.1) for all

. It will sometimes be convenient to increase

so that

is satisfied on the whole domain of

for

. For example, the functions

and

are logarithmically slow, with

respectively.

In this paper, the main role played by logarithmically slow functions is to measure size reduction in multiplication algorithms. In other words, multiplication of objects of size

will be reduced to multiplication of objects of size

, where

for some logarithmically slow function

. The following result asserts that, from the point of view of iterators, such functions are more or less interchangeable with

The next result is our main tool for converting recurrence inequalities into actual asymptotic bounds for solutions.

6.The Crandall–Fagin algorithm

Consider the problem of computing the product of two

-bit integers modulo

. The naive approach is to compute their ordinary

-bit product and then reduce modulo

. The reduction cost is negligible because of the special form of

. If

is divisible by a high power of two, one can save a factor of two by using the fact that FFTs naturally compute cyclic convolutions.

An ingenious algorithm of Crandall and Fagin [13] allows for the gain of this precious factor of two for arbitrary . Their algorithm is routinely used in the extreme case where

is prime, in the large-scale GIMPS search for Mersenne primes (see http://www.mersenne.org/).

A variant of the Crandall–Fagin algorithm was a key ingredient of our algorithm for integer multiplication that conjecturally achieves the bound

[18, Section 9]. In this section we present yet another variant, for computing products in

6.1.The Crandall–Fagin reduction for polynomials

Let

and

be positive integers with

and

. Our aim is to reduce multiplication in

to multiplication in

. In the applications in subsequent sections,

will be a divisor of

with many small factors, so that we can multiply efficiently in

using FFTs over

Suppose that

is represented as

, for some irreducible

of degree

. The reduction relies on the existence of an element

such that

. (This is the analogue of the real

-th root of

appearing in the usual Crandall–Fagin algorithm, which was originally formulated over

.) It is easy to see that such

may not always exist; for example, there is no cube root of

. Nevertheless, the next result shows that if

is large enough, then we can always find some modulus

for which a suitable

exists.

Proof. We first observe that if

is a primitive root, then

cannot lie in a proper subfield of

. (This property is independent of

.) Indeed, if

for some proper divisor

, then every

-th power in

lies in

. This contradicts the fact that the number of

-th powers in

is at least

Now we give an algorithm for computing

and

. We start by using Lemma 3.1 to compute an irreducible

of degree

in time

, and we temporarily represent

Using Lemma 3.2, we may construct a list of

candidates for primitive roots of

(see the proof of Lemma 3.3). For each candidate

, we compute

and its powers

. If these are linearly dependent over

then

belongs to a proper subfield (and so, as shown above,

is not actually a primitive root). Thus we must eventually encounter some

for which they are linearly independent. We take

to be the minimal polynomial of this

, which may be computed by finding a linear relation among

. Let

be the isomorphism that sends

. The matrix of

with respect to the standard bases

and

is given by the coefficients of

. The inverse of this matrix yields the matrix of

. We then set

, so that

For each candidate

, the cost of computing the necessary powers of

, and the various linear algebra steps require time

using classical matrix arithmetic.

In the remainder of this section, we fix some

and

as in Proposition 6.1, and assume that

is represented as

. Suppose that we wish to compute the product of

. The presentation here closely follows that of [18, Section 9.2]. We decompose

and

so that

. For any

, define

as follows. Choose

so that

lies in the interval

, and put

Since the left hand side lies in the interval

, this shows that

. Now, since

, we have

Since

and similarly for

, we have

. Note that we may recover

from

in time

, by a standard overlap-add procedure (provided that

Now, regarding

and

as elements of

by sending

, define polynomials

and

for

, and let

be their (cyclic) product. Then

coincides with the reinterpretation of

as an element of

. Moreover, we may recover

unambiguously from

, as

and

. Altogether, this shows how to reduce multiplication in

to multiplication in

Remark 6.2. The pair

can be computed from

bit operations, so we may compute the sequences

and

in time

. Moreover, since

takes on only two possible values, we may compute the sequence

using

multiplications in

7.The main algorithm

Consider the problem of computing

products

with

, i.e.,

products with one fixed operand. Denote the cost of this operation by

. Our algorithm for this problem will perform

forward DFTs and

inverse DFTs, so it is convenient to introduce the normalisation

This is well-defined since clearly

. Roughly speaking,

may be thought of as the notional cost of a single DFT.

The problem of multiplying two polynomials in

of degree less than

may be reduced to the above problem by using zero-padding, i.e., taking

and

. Since

, we obtain

. Thus it suffices to obtain a good bound for

The next theorem gives the core of the new algorithm for the case that

is large relative to

Proof. We wish to compute

products

with

, for some sufficiently large

. We assume throughout that

and so that

, and

-smooth. We choose long and short target transform lengths

and

. For large enough

we then have

, so we may apply Theorem 4.6. This yields

-smooth integers

, with known factorisations, such that

divides

and lies in the range

, and such that

for all

. Finally we set

, so that

also divides

, and we put

. All of these parameters may be computed in time

Crandall–Fagin reduction. Let us check that the hypotheses of section 6.1 are satisfied, to enable the reduction to multiplication in

. We certainly have

and

. For Proposition 6.1, observe that

, so

for large

. Thus we obtain an irreducible

of degree

, and an element

such that

, in time

. Each multiplication in

has cost

(see section 3).

Computing the sequences

and

costs

, and computing the sequence

costs

. The initial splitting and final overlap-add phases require time

, and the multiplications by

and

cost

Long transforms. The factorisation of

is known, and

divides

, so by Lemma 3.3 we may compute a primitive

-th root of unity

in time

We will multiply in

by using DFTs with respect to

. The table of roots

may be computed in time

. In a moment, we will describe an algorithm

for computing a “short” DFT of length

with respect to

; we then use the algorithm

for the main transform of length

(see section 2.3). The corresponding twiddle factor tables may be extracted in time

Let

denote the complexity of

, and for

let

denote the cost of performing

independent DFTs of length

using

. Then by (2.4) we have

Bluestein conversion. We now begin constructing

, assuming that

independent transforms are sought. We first use Bluestein's algorithm (section 2.5) to convert each DFT of length

to a multiplication in

. We must check that

is invertible in

is even; indeed, if

is even, then so is

, so

. The Bluestein conversion contributes

to the cost of

We must also compute a suitable table of roots, once at the top level. We first extract the table

from the top level table in time

, and then sort them into the correct order (and perform any necessary additions) in time

. Over all

the cost is

Kronecker substitution. We finally convert each multiplication in

to a multiplication in

using Kronecker substitution (see section 2.6). The latter multiplications have cost

, since one argument is fixed. After the multiplications, we must also perform

divisions by

to recover the results in

, at cost

. Consolidating the estimates for the Bluestein conversion and Kronecker substitution, we have

To compute the desired

products, we must execute

forward transforms, and

inverse transforms. Each product also requires

pointwise multiplications in

and

multiplications by

. These have cost

, which is absorbed by the

term. Thus we obtain

Dividing by

, taking suprema over

, and using (7.2) and (7.3), yields the bound (7.1).

Let

and

be as in Theorem 7.1. Increasing

if necessary, we may assume that

for

, and that

Let

for each

. We will consider the regions

and

separately. First consider the case

. There are only finitely many primes

, so we may assume that

and that

. In this region we use Kronecker substitution: by (2.5) and (1.1) we have

Now consider the case

. Here we invoke Theorem 7.1 to obtain absolute constants

such that for every

, there exist

and

such that

Set

. Applying Proposition 5.3 with

, we find that

uniformly for

. Since

and

, we conclude that

uniformly for

8.Conjecturally faster multiplication

Recall that in [18] we established the bound

with

for the complexity of integer multiplication. We also proved that this can be improved to

, assuming the following slight weakening of the Lenstra–Pomerance–Wagstaff conjecture on the distribution of Mersenne primes [28, 35] (see [18, Section 9] for further discussion).

The source of the (conjectural) speedup is as follows. The

algorithm of [18, Section 6] computes DFTs over

, and so we encounter the “short product obstruction”. Namely, to compute the product of two real numbers with

significant bits using FFT algorithms, we are forced to compute a full product of two

-bit integers, and then truncate the result to

bits. To achieve

, we replaced the coefficient ring

by the finite field

, where

is a Mersenne prime. Multiplication modulo

is a “cyclic product” rather than a short product, and this saves a factor of two at each recursion level.

The aim of this section is to outline a credible strategy for achieving the same improvement, from

, in the context of multiplication in

, at least under certain plausible number-theoretic hypotheses.

Consider the problem of multiplying polynomials in

of degree less than

. In the Kronecker substitution region, i.e., for

, we can of course achieve

if we assume Conjecture 8.1.

Now consider the region where

is much larger than

. In the algorithm of section 7, we reduced the multiplication problem to computing products in

, where

for some monic irreducible

of degree

, and where

is a “short” transform length. These multiplications were in turn converted to multiplications in

via Kronecker substitution. It is exactly this factor of two in zero-padding that we wish to avoid.

Taking our cue from the integer case, we observe that if the modulus

is of a particularly special form, then this factor of two can be eliminated. Indeed, suppose that

Then

is isomorphic to

, which is a quotient of

. Assuming further that

, the latter ring is isomorphic to

. In other words, we can reduce a cyclic convolution over

of length

to a cyclic convolution over

of length

, rather than length

. This saves a factor of two at this recursion level; if we can manage something similar at every recursion level, we reduce

from

The essential question is therefore how to choose

so that

is irreducible over

. The following lemma gives a simple number-theoretic characterisation of such

Proof. If

is a nontrivial factor of

, then

is divisible by

, so

has the nontrivial factor

. On the other hand, suppose that

is prime. If

, then

is not irreducible; otherwise, the number of irreducible factors of

over

is exactly

, where

is the order of

[36, Theorem 2.13 and Proposition 2.14] (this last statement is sometimes called the “cyclotomic reciprocity law”).

A first attempt to reach

might run as follows. First choose

as in the proof of Theorem 7.1, so that

has many roots of unity of smooth order. Then use the same multiplication algorithm as before, but now working over

, where

is the smallest positive multiple of

such that

is prime and such that

is a primitive root modulo

First, if

, there may be a trivial obstruction to the existence of suitable

. For example, if

and

, then for any

we have

by quadratic reciprocity, so

cannot be a primitive root modulo

. The only way to avoid this seems to be to insist that

not be divisible by

in the first place. Therefore we propose the following strengthening of Lemma 4.2.

It seems likely that this conjecture should be accessible to specialists in analytic number theory. Experimentally, the “missing” factor

does not seem to have much effect on the propensity of

to have many divisors

such that

is prime. We could not find a quick way to prove the conjecture directly from Lemma 4.2.

The second problem is more serious. A special case of Artin's conjecture on primitive roots (see [25] for a survey) asserts that for any prime

, there are infinitely many primes

for which

is a primitive root modulo

. Unfortunately, there is not even a single prime

for which Artin's conjecture is known to hold! Therefore, in general we cannot prove existence of a suitable

However, we do have the following result of Lenstra, which states that the only obstruction to the existence of

is the trivial one noted above, provided we are willing to accept GRH.

Proof. This is a special case of [23, Theorem 8.3]. (For a description of exactly which number fields are supposed to satisfy GRH for this result to hold, see [23, Theorem 3.1].)

The third problem is that existence of

is not enough; we also need a bound on its size, as a function of

and

. The literature provides little guidance on this question. Under the hypotheses of Lemma 8.4 (including GRH), it is known that the number of primes

such that

and for which

is a primitive root modulo

grows as

for some positive

(see for example [24, Theorem 2]). Both

and the implied big-

constant depend on

and

. While the value of

is reasonably well understood, we could not find in the literature a similarly precise description of the implied constant, so we have been unable to derive a bound for

In the interests of getting our conjectural algorithm off the ground, we make the following guess. Define

to be the smallest prime

such that

and such that

is a primitive root modulo

(assuming that such

exists).

We have deliberately given an absurdly weak formulation of this conjecture, so as to provide the largest possible target for analytic number theorists. For example, it would be enough to prove that

for all

. We suspect that the conjecture is accessible under GRH; presumably a proof would require analysing the implied constant in (8.1).

Remark 8.6. The reason we need

to be large compared to

is that it is easy to construct, using the Chinese remainder theorem, a prime

which fails to be a primitive root modulo

for all primes

up to a prescribed bound.

Proof of Theorem 1.2. The proof is similar to that of Theorem 1.1. We will only give a sketch, highlighting the main differences.

In the region

, we use Kronecker substitution together with [18, Theorem 2], which states that

under Conjecture 8.1.

Now assume that

. We will pursue a strategy similar to Theorem 7.1, recursing from

down to an exponentially smaller

. Assume that we wish to compute

products in

with one fixed operand.

Using an appropriate modification of Theorem 4.1, in which the use of Lemma 4.2 is replaced by Conjecture 8.3, we find integers

and

such that

(or

, and

-smooth.

Now we wish to apply Conjecture 8.5 to construct suitable

. However, that conjecture requires that

. If we have instead

, we simply use the algorithm of Theorem 7.1. This yields the expansion factor

, but only for

recursion levels (depending on

), since

. So it is permissible to assume that

, losing only a constant factor in the main complexity bound for the last few recursion levels.

Applying Conjecture 8.5 and Lemma 8.2, we obtain a positive multiple

, such that

is prime,

, and

is irreducible in

. We will take

as our model for

; note that

Notice that

for some logarithmically slow function

Indeed, there exists

and

such that

and

for large

. Increasing

if necessary, we get

, the latter being a logarithmically slow function of

In the proof of Theorem 7.1, we selected the transform length

first, and then chose

to fine-tune the total bit size. Here we have less control over

, so we must use a different strategy. Choose target long and short transform lengths

and

. Applying Theorem 4.6, we obtain

-smooth integers

such that

divides

and

for each

, and such that

for large

. Let

, and let

be the smallest prime greater than

and different to

. By [4] we have

, so

We now apply the Crandall–Fagin algorithm to reduce multiplication in

to multiplication in

. The prerequisite

is satisfied (since

is even). Instead of using Proposition 6.1 to construct

satisfying

, we may simply take

. The required modular inverse exists, as

and

are distinct primes, and

since

-smooth.

We now take advantage of the isomorphism

, where

. We multiply in

by using DFTs over

: one DFT of length

over

reduces to

DFTs of length

over

. The latter are handled by decomposing into short transforms of length

, which are in turn converted to multiplications in

via Bluestein's algorithm. Finally –- and this is where all the hard work pays off –- each such multiplication reduces to a multiplication in

, since

. The multiplications in

itself are handled using the algebraic Schönhage–Strassen algorithm.

We omit the rest of the complexity argument, which is essentially the same as that of Theorem 7.1 and Theorem 1.1. We mention only that

was chosen sufficiently large that the multiplications in

and in

make a negligible contribution overall.

9.Notes and generalisations

In this section we outline some directions along which the results in this paper can be extended. We also provide some hints concerning the practical usefulness of the new ideas. Our treatment is more sketchy and we plan to provide more details in a forthcoming paper.

9.1.Multiplication of polynomials over

Recall that

denotes the cost of multiplying polynomials in

of degree less than

, where we assume that some model

for

has been fixed in advance.

where

denotes the cost of dividing a polynomial of degree less than

. Having established the bound

, it is now permissible to assume that

is increasing, so the usual argument for Newton iteration shows that

. Using again that

is increasing, we obtain

9.2.Multiplication of polynomials over

For any prime

and any integer

, denote by

the bit complexity of multiplying polynomials in

For

we may simply use Kronecker substitution. For

we must modify the algorithm of section 7. Recall that in that algorithm we reduced multiplication in

to multiplication

, where

is a transform length dividing

. Our task is to define a ring

, analogous to

, so that multiplication in

can be reduced to multiplication in

Let

be the natural projection

, and write

also for the corresponding map

. Let

be any monic irreducible polynomial of degree

, and let

be an arbitrary lift via

, also monic of degree

. Then

is irreducible, and we will take

Any primitive

-th root of unity

has a unique lift to a principal

-th root of unity

. (This can be seen, for example, by observing that

must be the image in

of a Teichmüller lift of

in the

-adic field whose residue field is

.) Given

, we may compute

efficiently using fast Newton lifting [11, Section 12.3].

Moreover, if we choose

and

as in Proposition 6.1, so that

, then fast Newton lifting can also be used to obtain

such that

9.3.Multiplication of polynomials over

We use the isomorphism

, where

is the prime decomposition of

. The cost of converting between the

and

representations is

[17, Section 10.3]. By Theorem ? we get

9.4.Complexity bounds for straight-line programs

Until now we have considered only complexity bounds in the Turing model. The new techniques also lead to improved bounds in algebraic complexity models. In what follows,

is always a commutative ring with identity.

For the simplest case, first assume that

is an

-algebra for some prime

. In the straight-line program model [9, Chapter 4], we count the number of additions and subtractions in

, the number of scalar multiplications (multiplications by elements of

), and the number of nonscalar multiplications (multiplications in

). As pointed out in the introduction, the best known bound for the total complexity was previously

, which was achieved by the Cantor–Kaltofen algorithm.

The idea of the proof is to use the same algorithm as in section 7, but instead of switching to Kronecker substitution when we reach

, we simply recurse all the way down to

. The role of the extension

is played by

, where

is monic and irreducible of degree

(here we have used the fact that

may be viewed as a subring of

, since

contains an identity element and hence a copy of

). Thus we reduce multiplication in

to multiplication in

. The latter multiplication may be handled by DFTs over

, since any primitive

-th root of unity in

corresponds naturally to a principal

-th root of unity in

The

bound covers the cost of the DFTs, which are accomplished entirely using additions, subtractions and scalar multiplications. Nonscalar multiplications are needed only for the pointwise multiplication step. To explain the

bound, we observe that at each recursion level the total “data size” grows by a factor of four: one factor of two arises from the Crandall–Fagin splitting, and another factor of two from the Kronecker substitution.

Note that in the straight-line program model, we give a different algorithm for each

. Thus all precomputed objects, such as the defining polynomial

and the required roots of unity, are obtained free of cost (they are built directly into the structure of the algorithm). The uniformity in

follows from the fact that the bounds for

and

in Theorem 4.1 are independent of

, although of course the algorithm will be different for each

. For each fixed

, we may use the same straight-line program for all

-algebras

Assuming Conjectures 8.3 and 8.5, the bounds may be improved to respectively

and

. However, in this case we lose uniformity in

, due to the issue raised in Remark 8.6.

Theorem 9.4 can be generalised to

-algebras, along the same lines as section 9.2. It is also possible to handle

-algebras for any integer

, i.e., rings of finite characteristic, but we cannot proceed exactly as in section 9.3, because the straight-line model has no provision for a “reduction modulo

” operation. Instead, we use a device introduced in [10]. Suppose that

and that

is now a

-algebra. For each

we may construct a straight-line program

that takes as input polynomials

of degree less than

, where

is any

-algebra, and computes

. By replacing every constant in

by a compatible constant in

, we obtain a straight-line program

that takes as input

of degree less than

, and computes

such that

. By the Chinese remainder theorem we may choose

such that

and

for each

. The linear combination

is then equal to

. We conclude that we may multiply polynomials in

using

additions, subtractions and scalar multiplications (by elements of

), and

nonscalar multiplications (i.e., multiplications in

). These bounds are not uniform in

, but for each

they are uniform over all

-algebras.

9.5.Other algebraic models

If we wish to take into account the cost of precomputations, and give a single algorithm that works uniformly for all

and

, we may use a more refined complexity model such as the BSS model [6]. In this model, a machine over is, roughly speaking, a Turing machine in which the tape cells hold elements of

. Actually, we need a multi-tape version of the model described in [6]. The machine can perform arithmetic operations on elements of

in unit time, but can also manipulate data such as index variables in the same way as a Turing machine, and must deal with data layout in the same way as a Turing machine. In this model we obtain similar bounds to Theorem 9.4, but without uniformity in

. Alternatively, we could obtain bounds uniform in

if we added extra terms to account for the precomputations.

Another point of view in Theorem 9.4 is that we have described a new evaluation-interpolation strategy for polynomials over an

-algebra

. We refer the reader to [21, Sections 2.1–2.4] for classical examples of evaluation-interpolation schemes, and [16] for algorithms specific to finite fields. Such schemes are characterised by two quantities: the evaluation/interpolation complexity

and the number

of evaluation points. The new algorithms yield the bounds

and

, and these can be used to prove complexity bounds for problems more general than polynomial multiplication. For example, we can multiply

matrices with entries in

using

ring operations, where

is an exponent of matrix multiplication. One of the main advantages of our algorithms is that

is almost linear, contrary to synthetic FFT methods [10, 31] derived from Schönhage–Strassen multiplication [32], which achieve only

In the setting of bilinear complexity [9, Chapter 14], the new algorithms do not improve asymptotically on the best known bounds. For example, it is known [27] that for any

there exist

-linear maps

and polynomials

for

, with

, such that

for all

. The new method yields the asymptotically inferior bound

. Nevertheless, these bounds are asymptotic and do not take into account that our new algorithm would rather work over (say)

instead of

. In practice, the new method might therefore outperform the bilinear algorithms from [27].

9.6.Implementations

The new complexity bounds have a quite theoretic flavour: in practice, slowly growing functions such as

and

really behave as constants. Furthermore, for concrete applications, it would be more relevant to redefine

With this definition, we observe that

for all feasible values of

in the foreseeable future. Nevertheless, we think that variants of the ideas in this paper could still be useful in practice.

Let us briefly sketch how an efficient multiplication algorithm over

could be implemented. Various modern processors offer a special instruction for the multiplication of two polynomials in

. The practical translation of the results from section 4 is to work over the field

where

. This has two advantages. First, if we represent elements of

by their residues modulo

, i.e., using a redundant representation, then each multiplication in

requires a single hardware multiply instruction, followed by simple shift and XOR instructions. Second, there exist primitive roots of unity of high smooth order

Transforms of small prime lengths

over

can be implemented using specialised codelets, similar to those implemented in FFTW3 [14]. Roughly speaking, the constant factor of a radix-

DFT is

. In our case,

, so specialised codelets should remain reasonably efficient for

. Consequently, through the mere use of such small radices, we obtain efficient algorithms for computing DFTs of lengths dividing

. To handle larger primes, we may use any algorithm for converting DFTs to convolutions. Rather than use Bluestein's algorithm as in the theoretical portion of the paper, it is more economical in this case to use Rader's algorithm, which reduces a DFT of length

to a cyclic convolution of length

, together with

additions in

(see also Remark 2.4).

More precisely, consider the remaining transform lengths

and

. We have

and

, so in each case, the DFT reduces to a two-dimensional convolution of size

, where

is a divisor of

and

. Such a convolution in turn reduces to

DFTs of length

together with

convolutions of length

. The latter cannot be handled using DFTs over

, since there are no roots of unity of order

, but in any case they are all very short convolutions and could be performed directly, or by using Karatsuba's algorithm.

Transforms of other lengths dividing

may be reduced to these cases via the Cooley–Tukey algorithm. Finally, to handle arbitrary input lengths efficiently, we may use a mixed-radix generalisation of the truncated Fourier transform [19, 20].

We also reemphasise the fact that DFTs of this kind can be used as an evaluation-interpolation technique. This approach is for instance attractive for the multiplication of polynomial matrices over

We have not yet implemented any of these algorithms. We expect that much fine-tuning will be necessary to make them most effective in practice. We intend to report on this issue in a future work.

Bibliography

[1]: L. M. Adleman and H. W. Lenstra Jr. Finding irreducible polynomials over finite fields. In Proceedings of the eighteenth annual ACM symposium on Theory of computing, STOC '86, pages 350–355. ACM Press, 1986.
[2]: L. M. Adleman, C. Pomerance, and R. S. Rumely. On distinguishing prime numbers from composite numbers. Annals of Math., 117:173–206, 1983.
[3]: A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The design and analysis of computer algorithms. Addison-Wesley, 1974.
[4]: R. C. Baker, G. Harman, and J. Pintz. The difference between consecutive primes. II. Proc. London Math. Soc. (3), 83(3):532–562, 2001.
[5]: L. I. Bluestein. A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics, 18(4):451–455, 1970.
[6]: L. Blum, F. Cucker, M. Shub, and S. Smale. Complexity and real computation. Springer-Verlag, 1998.
[7]: A. Bostan, P. Gaudry, and É. Schost. Linear recurrences with polynomial coefficients and application to integer factorization and Cartier-Manin operator. SIAM J. Comput., 36:1777–1806, 2007.
[8]: R. P. Brent. Fast multiple-precision evaluation of elementary functions. J. Assoc. Comput. Mach., 23(2):242–251, 1976.
[9]: P. Bürgisser, M. Clausen, and M. A. Shokrollahi. Algebraic complexity theory. Springer-Verlag, 1997.
[10]: D. G. Cantor and E. Kaltofen. On fast multiplication of polynomials over arbitrary algebras. Acta Infor., 28:693–701, 1991.
[11]: H. Cohen, G. Frey, R. Avanzi, Ch. Doche, T. Lange, K. Nguyen, and F. Vercauteren, editors. Handbook of elliptic and hyperelliptic curve cryptography. Discrete Mathematics and its Applications. Chapman & Hall/CRC, Boca Raton, FL, 2006.
[12]: J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Computat., 19:297–301, 1965.
[13]: R. Crandall and B. Fagin. Discrete weighted transforms and large-integer arithmetic. Math. Comp., 62(205):305–324, 1994.
[14]: M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proc. IEEE, 93(2):216–231, 2005.
[15]: M. Fürer. Faster integer multiplication. In Proceedings of the Thirty-Ninth ACM Symposium on Theory of Computing, STOC 2007, pages 57–66, New York, NY, USA, 2007. ACM Press.
[16]: S. Gao and T. Mateer. Additive fast Fourier transforms over finite fields. IEEE Trans. Inform. Theory, 56(12):6265–6272, 2010.
[17]: J. von zur Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, 2nd edition, 2002.
[18]: D. Harvey, J. van der Hoeven, and G. Lecerf. Even faster integer multiplication. Technical report, HAL, 2014. http://hal.archives-ouvertes.fr.
[19]: J. van der Hoeven. The truncated Fourier transform and applications. In J. Gutierrez, editor, Proc. ISSAC 2004, pages 290–296, Univ. of Cantabria, Santander, Spain, July 4–7 2004.
[20]: J. van der Hoeven. Notes on the Truncated Fourier Transform. Technical Report 2005-5, Université Paris-Sud, Orsay, France, 2005.
[21]: J. van der Hoeven. Newton's method and FFT trading. J. Symbolic Comput., 45(8):857–878, 2010.
[22]: D. E. Knuth. The art of computer programming, volume 3: Sorting and Searching. Addison-Wesley, Reading, MA, 1998.
[23]: H. W. Lenstra, Jr. On Artin's conjecture and Euclid's algorithm in global fields. Invent. Math., 42:201–224, 1977.
[24]: P. Moree. On primes in arithmetic progression having a prescribed primitive root. J. Number Theory, 78(1):85–98, 1999.
[25]: P. Moree. Artin's primitive root conjecture—a survey. Integers, 12(6):1305–1416, 2012.
[26]: C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.
[27]: J. Pieltant and H. Randriam. New uniform and asymptotic upper bounds on the tensor rank of multiplication in extensions of finite fields. Math. Comp., 2014. To appear.
[28]: C. Pomerance. Recent developments in primality testing. Math. Intelligencer, 3(3):97–105, 1980/81.
[29]: C. M. Rader. Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE, 56(6):1107–1108, June 1968.
[30]: K. R. Rao, D. N. Kim, and J. J. Hwang. Fast Fourier Transform - Algorithms and Applications. Signals and Communication Technology. Springer-Verlag, 2010.
[31]: A. Schönhage. Schnelle Multiplikation von Polynomen über Körpern der Charakteristik 2. Acta Infor., 7:395–398, 1977.
[32]: A. Schönhage and V. Strassen. Schnelle Multiplikation großer Zahlen. Computing, 7:281–292, 1971.
[33]: V. Shoup. New algorithms for finding irreducible polynomials over finite fields. Math. Comp., 189:435–447, 1990.
[34]: V. Shoup. Searching for primitive roots in finite fields. Math. Comp., 58:369–380, 1992.
[35]: S. Wagstaff. Divisors of Mersenne numbers. Math. Comp., 40(161):385–397, 1983.
[36]: L. C. Washington. Introduction to cyclotomic fields, volume 83 of Graduate Texts in Mathematics. Springer-Verlag, New York, second edition, 1997.

1.Introduction

1.1.Our contributions

1.2.Outline of the paper

2.Survey of classical tools

2.1.Arrays and sorting

2.2.Discrete Fourier transforms

2.3.The Cooley–Tukey FFT

2.4.Fast Fourier multiplication

2.5.Bluestein's chirp transform

2.6.Kronecker substitution

3.Arithmetic in finite fields

4.Preparing for DFTs of large smooth orders

5.Logarithmically slow recurrence inequalities

6.The Crandall–Fagin algorithm

6.1.The Crandall–Fagin reduction for polynomials

7.The main algorithm

8.Conjecturally faster multiplication

9.Notes and generalisations

9.1.Multiplication of polynomials over

9.2.Multiplication of polynomials over

9.3.Multiplication of polynomials over

9.4.Complexity bounds for straight-line programs

9.5.Other algebraic models

9.6.Implementations

Bibliography