Polynomial multiplication over finite fieldsin time O(n*log n)

Let

denote the bit complexity of multiplying two integers of at most

bits in the deterministic multitape Turing model [16]. Similarly, let

denote the bit complexity of multiplying two polynomials of degree

. Here

is the finite field with

elements, so that

for some prime number

and integer

. For computations over

, we tacitly assume that we are given a monic irreducible polynomial

of degree

, and that elements of

are represented by polynomials in

of degree

. Notice that multiplication in

can also be regarded as “carryless” integer multiplication in base

The best currently known bounds for

and

were obtained in [21] and [23]. For constants

and

, we proved there that

For the statement of our new results, we need to recall the concept of a “Linnik constant”. Given two integers

and

with

, we define

We say that a number

is a Linnik constant if

. The smallest currently known Linnik constant

is due to Xylouris [57]. It is widely believed that any number

is actually a Linnik constant; see section 5 for more details.

Theorem 1.2 is our main result. Notice that it implies

in terms of the bit-size

of the inputs. In the companion paper [22], we show that the bound

actually holds unconditionally, thereby superseding Theorem 1.1. Since the proof of Theorem 1.1 is a good introduction to the proof of Theorem 1.2 and also somewhat shorter than the proof of the unconditional bound

in [22], we decided to include it in the present paper. The assumption

being satisfied “in practice”, it is also conceivable that variants of the corresponding algorithm may lead to faster practical implementations.

This companion paper [22] uses some of the techniques from the present paper, as well as a new ingredient: Gaussian resampling. This technique makes essential use of properties of real numbers, explaining why we were only able to apply it to integer multiplication. By contrast, the techniques of the present paper can be used in combination with both complex and number theoretic Fourier transforms. For our presentation, we preferred the second option, which is also known to be suitable for practical implementations [17]. It remains an open question whether Theorem 1.2 can be replaced by an unconditional bound.

In the sequel, we focus on the study of

and

from the purely theoretical perspective of asymptotic complexity. In view of past experience [17, 26, 32], variants of our algorithms might actually be relevant for machine implementations, but these aspects will be left aside for now.

1.1.Brief history and related work

Integer multiplication.The development of efficient methods for integer multiplication can be traced back to the beginning of mathematical history. Multiplication algorithms of complexity

were already known in ancient civilizations, whereas descriptions of methods close to the ones that we learn at school appeared in Europe during the late Middle Ages. For historical references, we refer to [54, section II.5] and [41, 6].

The first more efficient method for integer multiplication was discovered in 1962, by Karatsuba [34, 33]. His method is also the first in a family of algorithms that are based on the technique of evaluation-interpolation. The input integers are first cut into pieces, which are taken to be the coefficients of two integer polynomials. These polynomials are evaluated at several well-chosen points. The algorithm next recursively multiplies the obtained integer values at each point. The desired result is finally obtained by pasting together the coefficients of the product polynomial. This way of reducing integer multiplication to polynomial multiplication is also known as Kronecker segmentation [27, section 2.6].

Karatsuba's original algorithm cuts each input integer in two pieces and then uses three evaluation points. This leads to an algorithm of complexity

. Shortly after the publication of Karatsuba's method, it was shown by Toom [56] that the use of more evaluation points allowed for even better complexity bounds, with further improvements by Schönhage [47] and Knuth [35].

The development of efficient algorithms for the required evaluations and interpolations at large sets of points then became a major bottleneck. Cooley and Tukey's rediscovery [9] of the fast Fourier transform (FFT) provided the technical tool to overcome this problem. The FFT, which was essentially already known to Gauss [29], can be regarded as a particularly efficient evaluation-interpolation method for special sets of evaluation points. Consider a ring

with a principal

-th root of unity

(see section 2.2 for detailed definitions; if

, then one may take

). Then the FFT permits evaluation at the points

using only

ring operations in

. The corresponding interpolations can be done with the same complexity, provided that

is invertible in

. Given

with

, it follows that the product

can be computed using

ring operations as well.

The idea to use fast Fourier transforms for integer multiplication is independently due to Pollard [45] and Schönhage–Strassen [49]. The simplest way to do this is to take

and

, while approximating elements of

with finite precision. Multiplications in

itself are handled recursively, through reduction to integer multiplications. This method corresponds to Schönhage and Strassen's first algorithm from [49] and they showed that it runs in time

for all

. Pollard's algorithm rather works with

, where

is a prime of the form

. The field

indeed contains primitive

-th roots of unity and FFTs over such fields are sometimes called number theoretic FFTs. Pollard did not analyze the asymptotic complexity of his method, but it can be shown that its recursive use for the arithmetic in

leads to a similar complexity bound as for Schönhage–Strassen's first algorithm.

The paper [49] also contains a second method that achieves the even better complexity bound

. This algorithm is commonly known as the Schönhage–Strassen algorithm. It works over

, where

is a Fermat number, and uses

as a principal

-th root of unity in

. The increased speed is due to the fact that multiplications by powers of

can be carried out in linear time, as they correspond to simple shifts and negations.

The Schönhage–Strassen algorithm remained the champion for more than thirty years, before being superseded by Fürer's algorithm [12]. In short, Fürer managed to combine the advantages of the two algorithms from [49], to achieve the bound (1.1) for some unspecified constant

. In [25, section 7], it has been shown that an optimized version of Fürer's original algorithm achieves

. In a succession of works [25, 18, 19, 21], new algorithms were proposed for which

, and

. In view of the companion paper [22], we now know that

Historically speaking, we also notice that improvements on the lowest possible values of

and

were often preceded by improvements modulo suitable number theoretic assumptions. In [25, section 9], we proved that one may take

under the assumption that there exist “sufficiently many” Mersenne primes [25, Conjecture 9.1]. The same value

was achieved in [10] modulo the existence of sufficiently many generalized Fermat primes. In [20], we again reached

under the assumption that

, where

is the Euler totient function.

Polynomial multiplication.To a large extent, the development of more efficient algorithms for polynomial multiplication went hand in hand with progress on integer multiplication. The early algorithms by Karatsuba and Toom [34, 56], as well as Schönhage–Strassen's second algorithm from [49], are all based on increasingly efficient algebraic methods for polynomial multiplication over more or less general rings

More precisely, let

be the number of ring operations in

that are required for the multiplication of two polynomials of degree

. A straightforward adaptation of Karatsuba's algorithm for integer multiplication gives rise to the algebraic complexity bound

The subsequent methods typically require mild hypotheses on

: since Toom's algorithm uses more than three evaluation points, its polynomial analogue requires

to contain sufficiently many points. Similarly, Schönhage–Strassen multiplication is based on “dyadic” FFTs and therefore requires

to be invertible in

. Schönhage subsequently developed a “triadic” analogue of their method that is suitable for polynomial multiplication over fields of characteristic two [48]. The bound

for general (not necessarily commutative) rings

is due to Cantor and Kaltofen [8].

Let us now turn to the bit complexity

of polynomial multiplication over a finite field

with

and where

is prime. Using Kronecker substitution, it is not hard to show that

Now the multiplication of polynomials in

of small degree can be reduced to integer multiplication using Kronecker substitution: the input polynomials are first lifted into polynomials with integers coefficients in

, and then evaluated at

. The desired result can finally be read off from the integer product of these evaluations. If

, this yields

On the other hand, for

, adaptation of the algebraic complexity bound

to the Turing model yields

where the first term corresponds to additions/subtractions in

and the second one to multiplications. Notice that the first term dominates for large

. The combination of (1.4) and (1.5) also implies

For almost forty years, the bound (1.5) remained unbeaten. Since Fürer's algorithm (alike Schönhage–Strassen's first algorithm from [49]) essentially relies on the availability of suitable roots of unity in the base field

, it admits no direct algebraic analogue. In particular, the existence of a Fürer-type bound of the form (1.2) remained open for several years. This problem got first solved in [27] for

, but using an algorithm that is very different from Fürer's algorithm for integer multiplication. Under suitable number theoretic assumptions, it was also shown in the preprint version [24] that one may take

, a result that was achieved subsequently without these assumptions [23].

Let us finally notice that it usually a matter of routine to derive better polynomial multiplication algorithms over

for integers

from better multiplication algorithms over

. These techniques are detailed in [27, sections 8.1–8.3]. Denoting by

the bit complexity of multiplying two polynomials of degree

over

, it is shown there that

uniformly in

, and for

. Exploiting the fact that finite fields only contain a finite number of elements, it is also a matter of routine to derive algebraic complexity bounds for

in the case when

is a ring of finite characteristic. We refer to [27, sections 8.4 and 8.5] for more details.

Related tools.In order to prove Theorems 1.1 and 1.2, we will make use of several other tools from the theory of discrete Fourier transforms (DFTs). First of all, given a ring

and a composite integer

such that the

are mutually coprime, the Chinese remainder theorem gives rise to an isomorphism

This was first observed in [1] and we will call the corresponding conversions CRT transforms. The above isomorphism also allows for the reduction of a univariate DFT of length

to a multivariate DFT of length

. This observation is older and due to Good [15] and independently to Thomas [55].

Another classical tool from the theory of discrete Fourier transforms is Rader reduction [46]: a DFT of prime length

over a ring

can be reduced to the computation of a cyclic convolution of length

, i.e. a product in the ring

. In section 4, we will also present a multivariate version of this reduction. In a similar way, one may use Bluestein's chirp transform [4] to reduce a DFT of general length

over

to a cyclic convolution of the same length. Whereas FFT-multiplication is a technique that reduces polynomial multiplications to the computation of DFTs, Bluestein's algorithm (as well as Rader's algorithm for prime lengths) can be used for reductions in the opposite direction. In particular, Theorem 1.2 implies that a DFT of length

over a finite field

can be computed in time

on a Turing machine.

Nussbaumer polynomial transforms [42, 43] constitute yet another essential tool for our new results. In a similar way as in Schönhage–Strassen's second algorithm from [49], the idea is to use DFTs over a ring

with

. This ring has the property that

is a

-th principal root of unity and that multiplications by powers of

can be computed in linear time. In particular, DFTs of length

can be computed very efficiently using only additions and subtractions. Nussbaumer's important observation is that such transforms are especially interesting for the computation of multidimensional DFTs (in [49] they are only used for univariate DFTs). Now the lengths of these multidimensional DFTs should divide

in each direction. Following a suggestion by Nussbaumer and Quandalle in [44, p. 141], this situation can be forced using Rader reduction and CRT transforms.

1.2.Overview of the new results

In order to prove Theorems 1.1 and 1.2, we make use of a combination of well-known techniques from the theory of fast Fourier transforms. We just surveyed these related tools; more details are provided in sections 2, 3 and 4. Notice that part of section 2 contains material that we adapted from previous papers [25, 27] and included for convenience of the reader. In section 5, the assumption on the existence of small Linnik constants

will also be discussed in more detail, together with a few consequences. Throughout this paper, all computations are done in the deterministic multitape Turing model [16] and execution times are analyzed in terms of bit complexity. The appendix covers technical details about the cost of data rearrangements when using this model.

Integer multiplication.We prove Theorem 1.1 in section 6. The main idea of our new algorithm is to reduce integer multiplication to the computation of multivariate cyclic convolutions in a suitable algebra of the form

Here

is a power of two and

are the first

prime numbers in the arithmetic progression

, where

is another power of two with

. Setting

, we choose

such that there exists a principal

-th root of unity in

that makes it possible to compute products in the algebra

using fast Fourier multiplication. Concerning the dimension

, it turns out that we may take

, although larger dimensions may allow for speed-ups by a constant factor as long as

can be kept small.

Using multivariate Rader transforms, the required discrete Fourier transforms in

essentially reduce to the computation of multivariate cyclic convolutions in the algebra

By construction, we may factor

, where

is an odd number that is small due the hypothesis on the Linnik constant. Since

, we notice that

contains a primitive

-th root of unity. CRT transforms allow us to rewrite the algebra

Now the important observation is that

can be considered as a “fast” principal

-th root of unity in both

and

. This means that products in

can be computed using multivariate Fourier multiplication with the special property that the discrete Fourier transforms become Nussbaumer polynomial transforms. Since

is a power of two, these transforms can be computed in time

. Moreover, for sufficiently small Linnik constants

, the cost of the “inner multiplications” in

can be mastered so that it only marginally contributes to the overall cost.

Polynomial multiplication.In order to prove Theorem 1.2, we proceed in a similar manner; see sections 7 and 8. It suffices to consider the case that

is prime. In section 7, we first focus on ground fields of characteristic

. This time, the ring

needs to be replaced by a suitable extension field

; in particular, we define

. More precisely, we take

, which ensures the existence of primitive

-th roots of unity in

and

The finite field case gives rise to a few additional technical difficulties. First of all, the bit size of an element of

is exponentially larger than the bit size of an element of

. This makes the FFT-approach for multiplying elements in

not efficient enough. Instead, we impose the additional requirement that

be pairwise coprime. This allows us to reduce multiplications in

to univariate multiplications in

, but at the expense of the much stronger hypothesis

A second technical difficulty concerns the computation of a defining polynomial for

(i.e. an irreducible polynomial

of degree

), together with a primitive

-th root of unity

. For our choice of

as a function of

, and thanks to work by Shoup [51, 52], it turns out that both

and

can be precomputed in linear time

The case when

finally comes with its own particular difficulties, so we treat it separately in section 8. Since

is no longer invertible, we cannot use DFT lengths

that are powers of two. Following Schönhage [48], we instead resort to “triadic” FFTs, for which

becomes a power of three. More annoyingly, since

is necessarily even for

, this also means that we now have to take

. In a addition to the multivariate cyclic convolutions of lengths

and

from before, this gives rise to multivariate cyclic convolutions of length

, which need to be handled separately. Although the proof of Theorem 1.2 in the case when

is very similar to the case when

, the above complications lead to subtle changes in the choices of parameters and coefficient rings. For convenience of the reader, we therefore reproduced most of the proof from section 7 in section 8, with the required adaptations.

Variants.The constants

and

in Theorems 1.1 and 1.2 are not optimal: we have rather attempted to keep the proofs as simple as possible. Nevertheless, our proofs do admit two properties that deserve to be highlighted:

We refer to Remarks 6.1 and 7.1 for some ideas on how to optimize the thresholds for acceptable Linnik constants.

Our main algorithms admit several variants. It should for instance be relatively straightforward to use approximate complex arithmetic in section 6 instead of modular arithmetic. Due to the fact that primitive roots of unity in

are particularly simple, the Linnik constants may also become somewhat better, but additional care is needed for the numerical error analysis. Using similar ideas as in [27, section 8], our multiplication algorithm for polynomials over finite fields can also be adapted to polynomials over finite rings and more general effective rings of positive characteristic.

Another interesting question is whether there exist practically efficient variants that might outperform existing implementations. Obviously, this would require further fine-tuning, since the “crazy” constants in our proofs were not motivated by practical applications. Nevertheless, our approach combines several ideas from existing algorithms that have proven to be efficient in practice. In the case of integer multiplication, this makes us more optimistic than for the previous post-Fürer algorithms [12, 13, 11, 25, 10, 18, 19, 21]. In the finite field case, the situation is even better, since similar ideas have already been validated in practice [26, 32].

Applications.Assume that there exist Linnik constants that are sufficiently close to

. Then Theorems 1.1 and 1.2 have many immediate consequences, as many computational problems may be reduced to integer or polynomial multiplication.

For example, Theorem 1.1 implies that the Euclidean division of two

-bit integers can be performed in time

, whereas the gcd can be computed in time

. Many transcendental functions and constants can also be approximated faster. For instance,

binary digits of

can be computed in time

and the result can be converted into decimal notation in time

. For details and more examples, we refer to [7, 31].

In a similar way, Theorem 1.2 implies that the Euclidean division of two polynomials in

with

of degree

can be performed in time

, whereas the extended gcd admits bit-complexity

. Since inversion of a non-zero element of

boils down to an extended gcd computation of degree

over

, this operation admits bit complexity

. The complexities of many operations on formal power series over

can also be expressed in terms of

. We refer to [14, 30] for details and more applications.

2.Univariate arithmetic

2.1.Basic notations

Elements of

and

will be called cyclic polynomials and negacyclic polynomials, respectively. For subsets

, it will be convenient to write

, and extend the above notations likewise. This is typically useful in the case when

for some other ring

and

for some

In our algorithms, we will only consider effective rings

, whose elements can be represented using data structures of a fixed bitsize

and such that algorithms are available for the ring operations. We will always assume that additions and subtractions can be computed in linear time

and we denote by

the bit complexity of multiplication in

. For some fixed invertible integer

, we sometimes need to divide elements of

; we define

to be the cost of such a division (assuming that a pre-inverse of

has been precomputed). If

with

and

prime, then we have

, and

, where

We write

for the bit complexity of multiplying two polynomials in

. We also define

and

. Multiplications in

and

are also called cyclic convolutions and negacyclic convolutions. Since reduction modulo

can be performed using

additions, we clearly have

In this paper we will frequently convert between various representations. Given a computable ring morphism

between two effective rings

and

, we will write

for the cost of applying

to an element of

. If

is an embedding of

into

that is clear from the context, then we will also write

instead of

. If

and

are isomorphic, with morphisms

and

for performing the conversions that are clear from context, then we define

2.2.Discrete Fourier transforms and fast Fourier multiplication

Let

be a ring with identity and let

be such that

for some

. Given a vector

, we may form the polynomial

, and we define the discrete Fourier transform of

with respect to

to be the vector

with

for all

Consider any

-th root of unity

. Given a cyclic polynomial

that is presented as the class of a polynomial

modulo

, the value

does not depend on the choice of

. It is therefore natural to consider

as the evaluation of

. It is now convenient to extend the definition of the discrete Fourier transform and set

. If

is invertible in

, then

becomes a ring isomorphism between the rings

and

. Indeed, for any

and

, we have

. Furthermore, the relation (2.6) provides us with a way to compute the inverse transform

In particular, denoting by

the cost of computing a discrete Fourier transform of length

with respect to

, we get

Since

and

coincide up to a simple permutation, we also notice that

. Setting

for some “privileged” principal

-th root of unity

(that will always be clear from the context), it follows that

A well known application of this isomorphism is to compute cyclic convolutions

for

using the formula

Computing cyclic convolutions in this way is called fast Fourier multiplication.

2.3.The Cooley–Tukey FFT

Let

be a principal

-th root of unity and let

where

. Then

is a principal

-th root of unity and

is a principal

-th root of unity. Moreover, for any

and

, we have

and

are algorithms for computing DFTs of length

and

, then we may use (2.7) to construct an algorithm for computing DFTs of length

as follows.

For each

, the sum inside the brackets corresponds to the

-th coefficient of a DFT of the

-tuple

with respect to

. Evaluating these inner DFTs requires

calls to

. Next, we multiply by the twiddle factors

, at a cost of

operations in

. Finally, for each

, the outer sum corresponds to the

-th coefficient of a DFT of an

-tuple in

with respect to

. These outer DFTs require

calls to

. On a Turing machine, the application of the

inner DFTs also requires us to reorganize

into

consecutive vectors of size

, and vice versa. This can be done in time

using matrix transpositions (see section A.1 in the appendix).

Let

denote the cost of performing a DFT of length

, while assuming that the twiddle factors

have been precomputed. Then computing the DFT using the above Cooley–Tukey algorithm yields the bound

2.4.Nussbaumer polynomial transforms

In the case that

for some effective ring

and power of two

, the indeterminate

is itself a principal

-th root of unity. Moreover, multiplications by powers of

can be computed in linear time

since

for all

and

. For this reason,

is sometimes called a fast principal root of unity.

Now assume that we wish to compute a discrete Fourier transform of size

over

. Then we may use the Cooley–Tukey algorithm from the previous subsection with

, and

. When using our faster algorithm for multiplications by twiddle factors, the term

in (2.8) can be replaced by

. Moreover, discrete Fourier transforms of length

just map pairs

to pairs

, so they can also be computed in linear time

. Altogether, the bound (2.8) thus simplifies into

Discrete Fourier transforms over rings

of the type considered in this subsection are sometimes qualified as Nussbaumer polynomial transforms.

2.5.Triadic Nussbaumer polynomial transforms

is invertible in the effective ring

from the previous subsection, then so is

, and we can compute the inverse DFT of length

using

However, if

has characteristic

(i.e.

), then this is no longer possible, and we need to adapt the method if we want to use it for polynomial multiplication.

Following Schönhage [48], the remedy is to use triadic DFTs. More precisely, assuming that

, we now work over the ring

instead of

, where

It is easily verified that

is a principal

-th root of unity in

. Moreover, multiplications of elements in

with powers of

can still be computed in linear time

. Indeed, given

and

, we have

Multiplications with

(resp.

) can be reduced to this case, by regarding them as two multiplications with

and

(resp. three multiplications with

, and

). Using similar formulas, polynomials in

of degree

can be reduced in linear time

modulo

Now assume that we wish to compute a discrete Fourier transform of size

over

. Discrete Fourier transforms of length

now map triples

to triples

, so they can be computed in linear time

. Using the Cooley–Tukey algorithm with

, and

, similar considerations as in the previous subsection again allows us to conclude that

We will qualify discrete Fourier transforms over rings

of this type as triadic Nussbaumer polynomial transforms.

2.6.Bluestein's chirp transform

We have shown above how to multiply polynomials using DFTs. Inversely, it is possible to reduce the computation of DFTs — of arbitrary length, not necessarily a power of two — to polynomial multiplication [4], as follows.

Let

be a principal

-th root of unity. If

is even, then we assume that there exists some

with

. If

is odd, then we take

, so that

and

. Consider the sequences

Now let

and

modulo

. Then (2.9) implies that

for all

. In other words, the computation of a DFT of even length

reduces to a cyclic convolution product of the same length, together with

additional operations in

Notice that the polynomial

is fixed and independent of

in this product. The

(and

) can be computed in time

via the identity

3.Multivariate arithmetic

3.1.Tensor products

It is well known that multivariate Fourier transforms can be interpreted as tensor products of univariate Fourier transforms. Since we will also need another type of multivariate “Rader transform” in the next section, it will be useful to recall this point of view. We refer to [37, Chapter XVI] for an abstract introduction to tensor products.

Let

be a commutative ring and let

and

be two

-modules. The tensor product

and

is an

-module together with a bilinear mapping

. This bilinear mapping satisfies the universal property that for any other

-module

together with a bilinear mapping

, there exists a unique linear mapping

with

. Up to natural isomorphisms, it can be shown that the tensor product operation

is associative and distributive with respect to the direct sum operation

Besides modules, it is also possible to tensor linear mappings. More precisely, given two linear mappings

and

, the mapping that sends

is clearly bilinear. This means that there exists a unique mapping

such that

for all

. We call

the tensor product of

and

Assume now that

and

are free

-modules of finite rank, say

and

. Then

is again a free

-module that can be identified with the set

of bidimensional arrays of size

and with coefficients in

. The tensor product of two vectors

and

is given by the array

with entries

Let

and

be two more free

-modules of finite rank. Assume that we are given two linear mappings

and

that can be computed by an algorithm. Then

can be computed as follows. Given

, we first apply

to each of the rows

. This yields a new array

with

for all

. We next apply

to each of the columns

. This yields an array

with

for all

. We claim that

. Indeed, if

, then

and

, whence the claim follows by linearity.

More generally, given

linear mappings

, a similar complexity analysis as in the case of the Cooley–Tukey algorithm yields

3.2.Vector notation and multivariate polynomials

For working with multivariate polynomials, it is convenient to introduce a few notations. Let

and

-tuples in

. We define

Notice that

. Elements in

can be regarded as

arrays. On a Turing tape, we recursively represent them as vectors of size

with entries in

The first tensor product is understood in the sense of

-modules and we regard

as a subset of

. The second tensor product is understood in the sense of

-algebras. Any polynomial

(or cyclic polynomial in

) can uniquely be written as a sum

where

is the vector of coefficients of

. We write

for the complexity of multiplying two polynomials in

and also define

3.3.Multivariate Fourier transforms

Assume now that

is such that

is a principal

-th root of unity for

and invertible positive integers

. As in the univariate case, cyclic polynomials

can naturally be evaluated at points of the form

. We now define the discrete Fourier transform of

to be the vector

with

for all

. Multivariate Fourier transforms again provide us with an isomorphism between

and

. Writing

for the cost to compute such a transform with respect to a most suitable

, we thus have

We claim that multivariate Fourier transforms can be regarded as the tensor product of univariate Fourier transforms with respect to each of the variables, that is,

where

for each

, and

is as above. The relation (3.5) follows from (3.6) by linearity, since the

of the form

span

In particular, in order to compute a multivariate Fourier transform, it suffices to compute univariate Fourier transforms with respect to each of the variables, as described in section 3.1. It follows that

The technique of FFT-multiplication from section 2.2 also naturally generalizes to multivariate polynomials in

. This yields the bound

3.4.CRT transforms

Let

be a tuple of pairwise coprime numbers and

. The Chinese remainder theorem provides us with an isomorphism of abelian groups

that boils down to a permutation of coefficients when using the usual power bases for

and

. For a suitable choice of

, we show in section A.3 of the appendix that this permutation can be computed efficiently on Turing machines, which yields

We will also need a multivariate generalization of this formula: let

and

be such that

and

are coprime for

. Then

3.5.Negacyclic CRT transforms

Assume now that

and let

, where

and

are coprime and

is odd. Then we have isomorphisms of

-algebras

Given

, the map

determines an isomorphism of abelian groups between

and

. As explained in the previous subsection, this gives rise to an isomorphism of

-algebras

In view of Lemma A.4, this isomorphism can be computed efficiently. Moreover,

, since

is odd and

. This yields a natural isomorphism

-algebras such that the following diagram commutes:

The projection

admits an obvious right inverse that sends a polynomial

modulo

. This lift can clearly be computed in linear time, so we again obtain

Alternatively, one may compute

in a similar way as the isomorphism of

-algebras between

from the previous subsection, while inverting signs whenever appropriate.

3.6.Tricyclic CRT transforms

In section 8, we will also need a triadic variant of the negacyclic CRT transforms from the previous subsection. More precisely, assume that

, where

and

are coprime,

is even, and

is not divisible by

. Then we have isomorphisms of

-algebras

Given

, the map

determines an isomorphism of abelian groups between

and

. In a similar way as in the previous subsection, this leads to an isomorphism of

-algebras

in which all maps and their right inverses can be computed efficiently. This again yields

We will qualify this kind of conversions between

and

as tricyclic CRT transforms.

4.Rader reduction

4.1.Univariate Rader reduction

Let

be a ring and let

be a principal

-th root of unity for some prime number

. The multiplicative group

is cyclic, of order

. Let

be such that

is a generator of this cyclic group. Given a polynomial

and

with

, we have

In summary, we have reduced the computation of a discrete Fourier transform of length

to a cyclic convolution of length

and

additions in

. This reduction is due to Rader [46], except for the minor twist that we subtract

in the definition of

. As far as we are aware, this twist is essentially due to Nussbaumer [43, section 5.2.2]. In the next subsection, it allows us to compute Rader transforms without increasing the dimension from

4.2.Univariate Rader transforms

Let us now show how to present Rader reduction using suitable linear transforms, called left, right and inverse Rader transforms (LRT, RRT and IRT). These transforms depend on the choice of

; from now on, we assume that

has been fixed once and for all. Given

and

, let

and

be as in section 4.1. Setting

The linear mappings

and

boil down to permutations and a linear number of additions and subtractions. Now permutations can be computed efficiently on Turing machines, as recalled in section A.1 of the appendix. More precisely, since

is a principal root of unity of order

, we necessarily have

. Then the permutation of

coefficients in

of bitsize

can be done using a merge sort in time

, whence

In this bound, the cost

corresponds to the trivial multiplication

. We included this term in anticipation of the multivariate generalization (4.3) below. We also recall that we assumed

to be precomputed. This precomputation can be done in time

: we first compute the powers

for

and then permute them appropriately.

4.3.Multivariate Rader transforms

Assume now that

are prime numbers, that we have fixed generators

for the cyclic groups

, and that

are such that

is a principal

-th root of unity for

. Let

and

. We define

Using the distributivity of

with respect to

, the algebra

is a direct sum of algebras of cyclic multivariate polynomials. More precisely, let

. For each index

, let

be such that

and

for

. Then

Elements in

correspond to sums

with

. Assuming the lexicographical ordering on

, such sums are represented on a Turing tape by concatenating the representations of the components

: see section A.5 in the appendix for details. We have

As in the case of (4.1), this bound does not include the cost of the precomputation of

. This precomputation can be performed in time

using the same “list and sort” technique as in the univariate case.

5.Prime numbers in arithmetic progressions

5.1.The smallest prime number in an arithmetic progression

Let

be the set of all prime numbers. Given two integers

and

with

, we define

and

A constant

with the property that

is called a Linnik constant (notice that we allow the implicit constants

and

with

for

to depend on

The existence of a Linnik constant

was shown by Linnik [39, 40]. The smallest currently known value

is due to Xylouris [57]. Assuming the Generalized Riemann Hypothesis (GRH), it is known that any

is a Linnik constant [28]. Unfortunately, these bounds are still too weak for our purposes.

Both heuristic probabilistic arguments and numerical evidence indicate that

satisfies the much sharper bound

, where

stands for Euler's totient function. More precisely, based on such evidence [38], Li, Pratt and Shakan conjecture that

Unfortunately, a proof of this conjecture currently seems to be out of reach. At any rate, this conjecture is much stronger than the assumption that

in Theorem 1.2.

5.2.Multiple prime numbers in arithmetic progressions

Given integers

, and

with

, let

denote the

-th prime number in the arithmetic progression

, so that

. Also define

Remark. Note that the implied big-Oh constant depends on

; a similar remark applies to Lemma 5.4 below.

Proof. Let

be an absolute constant with

for all

. Let

be given and let

be a fixed remainder class with

Let

be the smallest prime number such that

and

. The prime number theorem implies that

. For each

, let

be such that

and

. Such an integer exists since

and

are coprime. Notice also that

and

are coprime. Now consider the prime numbers

for

. By construction, we have

, so the numbers

are pairwise distinct. We also have

and

, whence

Remark 5.2. For numbers

with a bounded number of prime factors, we actually have

in the proof of Lemma 5.1, which leads to the sharper bound

. It is possible that the bound

always holds, but we were unable to prove this so far. If

is no longer fixed, then it is not hard to see that

, which yields

. Concerning a generalization of the Li–Pratt–Shakan conjecture, it is likely that there exist constants

and

with

It is actually even likely that

and

. At any rate, it would be interesting to empirically determine

and

via numerical experiments.

5.3.Forcing coprime multipliers

When searching for multiple prime numbers

in an arithmetic progression, as in the previous subsection, it will be useful in sections 7 and 8 to insist that the multipliers

be pairwise coprime. The simplest way to force this is to define the

by induction for

using

(In Lemma 5.3 below, we will show that such a value of

exists for all

, provided that

is even.) We then define

and

Proof. Modulo the replacement of

, we may assume without loss of generality that

. First we show that for each prime

, there exists an integer

such that

and

. Indeed, if

then

so we may take

; if

, then

, so at least one

satisfies

and

. Now, using the Chinese remainder theorem, we may construct an integer

such that

and

for all

. Then we have

and

. Since

, we also must have

Now

, since

and

. It follows that

is well defined. Let

be the integer with

and take

, so that

. We conclude by observing that

, whence

Remark 5.5. Our bound for

seems overly pessimistic. It is possible that a bound

always holds, but we were unable to prove this so far. It would also be interesting to investigate counterparts of the Li–Pratt–Shakan conjecture.

6.Proof of Theorem 1.1

Step 0: setting up the parameters. Assume that we wish to multiply two

-bit integers. For sufficiently large

and suitable parameters

, we will reduce this problem to a multiplication problem in

. If

, then any traditional multiplication algorithm can be used. The threshold

should be chosen such that various asymptotic relations encountered in the proof below hold for all

. The other parameters are chosen as follows:

We claim that

and the rest of these parameters can be computed in linear time

. Recall from [2] that we can test whether a

-bit integer is prime in time

. We clearly have

, whence

for

, by Lemma 5.1. For sufficiently large

, this means that

The brute force computation of

requires

primality tests of total cost

. From

, we also deduce

and

. In particular,

; for

sufficiently large, we thus get

In order to compute

, we need to perform

primality tests of total cost

. This completes the proof of our claim.

Step 1: Kronecker segmentation. For sufficiently large

, the last inequality implies that

. In order to multiply two

-bit integers

and

, we expand them in base

, i.e.

Let

be the reductions of these polynomials modulo

and

. Since

and

for

, we can read off the product

from the product

, after which

. Computing

from

and

clearly takes linear time

, and so does the retrieval of

and

from

. In other words, we have shown that

Step 2: CRT transforms. At a second stage, we use negacyclic CRT transforms (section 3.5) followed by traditional CRT transforms (section 3.4) in order to reduce multiplication in

to multiplication in

, where

, and

The computation of a generator of the cyclic group

can be done in time

, by [53]. The lift of the

-th power of this generator to

can be computed in time

and yields a primitive

-th root of unity

. We perform the multivariate cyclic convolutions in

using fast Fourier multiplication, where the discrete Fourier transforms are computed with respect to

. In view of (3.8), this leads to

using Schönhage–Strassen's algorithm for integer multiplication. Now we observe that

and

, whence

, again using Schönhage–Strassen multiplication. Performing multiplications in

using Kronecker multiplication, it also follows that

Step 3: multivariate Rader reduction. We compute the multivariate discrete Fourier transforms using Rader reduction. For each

, let

be such that

and

for

. Then (4.3) implies

The necessary precomputations for each of the

Rader reductions can be performed over

instead of

. Using Schönhage–Strassen multiplication, this can be done in time

Now for any

, we have

. Performing multiplications in

using

-fold Kronecker substitution and Cantor–Kaltofen's variant of Schönhage–Strassen multiplication over

, this yields

Step 4: second CRT transforms. For

, we may decompose

, where

is odd. Using multivariate CRT transforms (Corollary A.8), this yields

Step 5: Nussbaumer polynomial transforms. Since

, we have a fast root of unity

of order

and in

. We may thus compute polycyclic convolutions of order

over

using fast Fourier multiplication, with Nussbaumer polynomial transforms in the role of discrete Fourier transforms. From (3.10), it follows that

Here we used the bound

, which is shown in a similar way as above. Modulo a change of variables

where

is a principal

-th root of unity in

, negacyclic convolutions in

reduce to cyclic convolutions in

. Combining this with the above relations, this yields

Step 6: residual cyclic convolutions. By construction,

contains principal roots of unity of orders

, which can again be computed in time

using a similar method as above. We may thus compute cyclic convolutions of order

over

using fast Fourier multiplication. Since

and

, we obtain

Using Bluestein reduction (2.10), the univariate discrete Fourier transforms can be transformed back into cyclic products:

We perform the cyclic products using Kronecker substitution. Since

is increasing (by definition), this yields

Step 7: recurse. Combining the relations (6.1), (6.2), (6.3), (6.4) and (6.5), while using the fact that

for sufficiently large

, we obtain

By Lemma 5.1, we have

for all

. For sufficiently large

, this means that

, whence

and

. Using that

, the latest bound on

therefore further simplifies into

for some suitable universal constant

. For sufficiently large

, using our assumptions

, we have

and

, whence

Let us show by induction over

that this implies

for all

. This is clear for

, so assume that

. Then the above inequality yields

Remark 6.1. In our proof, we have not attempted to optimize the constant

. With a bit more work, it should be possible to achieve any constant larger than

, as follows:

We finally recall that we opted to present our algorithm for number theoretic FFTs. When switching to approximate complex FFTs, one advantage is that one may work more extensively with primitive roots of power of two orders. This should also allow for the use of Crandall–Fagin's reduction (in a similar way as in [25]) to further release the assumption on the Linnik constant to

7.Proof of Theorem 1.2 in characteristic

Recall that elements of the finite field

are represented as remainder classes of polynomials in

modulo

, for some given monic irreducible polynomial

of degree

. Using Kronecker substitution and segmentation, one has the following well known complexity bounds [27, section 3]:

In order to establish Theorem 1.2, it therefore suffices to consider the case when

is a prime number. In this section, we first prove the theorem in the case when

. The case

is treated in the next section. If

, then it is well known that

, again using Kronecker substitution. In this particular case, Theorem 1.2 therefore follows from Theorem 1.1. Let

be a Linnik constant with

. Our algorithm proceeds as follows.

Step 0: setting up the parameters. Assume that we wish to multiply two polynomials of degree

. For

with

and suitable parameters

and

, we will reduce this problem to a multiplication problem in

. If

, then we use Kronecker multiplication. The factor

is a sufficiently large absolute constant such that various asymptotic inequalities encountered during the proof hold for all

. The other parameters are determined as follows:

and

is a constant. As in the previous section, the prime numbers

and the rest of the parameters can be computed in linear time

. Let us proceed with a few useful bounds that hold for sufficiently large

. From the definitions of

and

, we clearly have

Step 0, continued: setting up the FFT field. Notice that

. Elements of the field

will be represented as polynomials in

modulo

, where

is a monic irreducible polynomial of degree

. By [51, Theorem 3.2], such a polynomial can be computed in time

, where we used the assumption that

Setting

, we claim that the field

admits a primitive

-th root of unity

, that can also be computed in time

. Indeed,

divides

for each

and

, whence

, so that

. We compute

by brute force: for all polynomials

, starting with those of smallest degree, we check whether

is a primitive

-th root. By [52, Theorem 1.1,

], this procedure succeeds after at most

checks. In order to check whether a given

is a primitive

-th root of unity, it suffices to verify that

for each

, which can be done in time

using binary powering. The total time to compute a suitable

is therefore bounded by

, using our assumption that

Step 1: Kronecker segmentation. Recall that

. In order to multiply two polynomials

of degree

, we cut them in chunks

of degree

, i.e.

Let

be the reductions of these polynomials modulo

and

. Since

and

, we can read off the product

from the product

, after which

. Computing

from

and

clearly takes linear time

, and so does the retrieval of

and

from

. In other words, we have shown that

Step 2: CRT transforms. At a second stage, we use negacyclic CRT transforms followed by traditional CRT transforms in order to reduce multiplication in

to multiplication in

, where

, and

We perform the multivariate cyclic convolutions in

using fast Fourier multiplication, where the discrete Fourier transforms are computed with respect to

. In view of (3.8) and

The reduction of a polynomial of degree

modulo

using Barrett's algorithm [3] essentially requires two polynomial multiplications in

of degree

, modulo the precomputation of a preinverse in time

. Performing multiplications in

using Kronecker substitution, we thus have

Step 3: multivariate Rader reduction. We compute the multivariate discrete Fourier transforms using Rader reduction. For each

, let

be such that

and

for

. Then

The necessary precomputations for each of the

Rader reductions can be performed over

instead of

, in time

Now for any

, we have

. Performing multiplications in

using

-fold Kronecker substitution and (1.6), this yields

Step 4: second CRT transforms. For

, we may decompose

, where

is odd. Using multivariate CRT transforms, this yields

Step 5: Nussbaumer polynomial transforms. Since

, we have a fast root of unity

of order

and in

. We may thus compute polycyclic convolutions of order

over

using fast Fourier multiplication, with Nussbaumer polynomial transforms in the role of discrete Fourier transforms:

Step 6: residual cyclic convolutions. This time, we convert the residual multivariate cyclic convolutions of length

back into univariate cyclic convolutions of length

using CRT transforms the other way around. This is possible since

are pairwise coprime, and we get

We perform the univariate cyclic products using Kronecker substitution. Since reductions of polynomials in

modulo

can be done in time

, this yields

Step 7: recurse. Combining the relations (7.1), (7.2), (7.3), (7.4) and (7.5), while using the fact that

, we obtain

where we used the fact that

. Recall that

and

for

, whence

, whereas

. Consequently, for large

, we have

In a similar way as in the previous section, this implies

for all

. Now recall that Kronecker substitution and Theorem 1.1 imply

for all

, whence

. We conclude that

Remark 7.1. The hypothesis

is very pessimistic and mainly due to the way we reduce the residual multivariate cyclic convolutions back to univariate ones. Our assumption that the

be pairwise coprime clearly makes this back-reduction very easy. Nevertheless, with more work, we expect that this assumption can actually be released. We intend to come back to this issue in a separate paper on the computation of multivariate cyclic convolution products.

8.Proof of Theorem 1.2 in characteristic

The proof of Theorem 1.2 in the case when

is similar to the one from the previous section, with two important changes:

We will also use the abbreviations

, and so on. Throughout this section, we assume that

is a Linnik constant with

. Our algorithm proceeds as follows.

Step 0: setting up the parameters. Let

and assume that we wish to multiply two polynomials of degree

. For

with

and suitable parameters

and

, we will reduce this problem to a multiplication problem in

. If

, then we use a naive multiplication algorithm. The threshold

is a sufficiently large absolute constant such that various asymptotic inequalities encountered during the proof hold for all

. The other parameters are determined as follows:

and

is a constant. As before, the prime numbers

and the rest of the parameters can be computed in linear time

. Let us proceed with a few useful bounds that hold for sufficiently large

. From the definitions of

and

, we clearly have

Step 0, continued: setting up the FFT field. As in the previous section, setting

, the defining polynomial

for

and a primitive

-th root of unity

can both be computed in time

Step 1: Kronecker segmentation. Recall that

. In order to multiply two polynomials

of degree

, we cut them in chunks

of degree

, i.e.

Let

be the reductions of these polynomials modulo

and

. Since

and

, we can read off the product

from the product

, after which

. Computing

from

and

clearly takes linear time

, and so does the retrieval of

and

from

. In other words, we have shown that

Step 2: CRT transforms. At a second stage, we use tricyclic CRT transforms followed by traditional CRT transforms in order to reduce multiplication in

to multiplication in

, where

, and

We perform the multivariate cyclic convolutions in

using fast Fourier multiplication, where the discrete Fourier transforms are computed with respect to

. Since the only possible divisions of elements in

by integers are divisions by one, they come for free, so (3.8) yields

The reduction of a polynomial of degree

modulo

using Barrett's algorithm [3] essentially requires two polynomial multiplications in

of degree

, modulo the precomputation of a preinverse in time

. Performing multiplications in

using Kronecker substitution, we thus have

Step 3: multivariate Rader reduction. This step is essentially the same as in section 7; setting

, we obtain the bound

Step 4: second CRT transforms. For

, we may decompose

, where

is coprime with

. Using multivariate CRT transforms, this yields

Step 5: Nussbaumer polynomial transforms. Since

, we have a fast root of unity

of order

and in

. We may thus compute polycyclic convolutions of order

over

using fast Fourier multiplication, with triadic Nussbaumer polynomial transforms in the role of discrete Fourier transforms. Using that divisions by integers are again for free in

, this yields

Step 6: residual cyclic convolutions. Using CRT transforms, the residual multivariate cyclic convolutions of length

are successively transformed into multivariate cyclic convolutions of lengths

and then

. Using that

are pairwise coprime, we get

We perform the univariate cyclic products using Kronecker substitution. Since reductions of polynomials in

modulo

can be done in time

, this yields

Step 7: recurse. Combining the relations (8.1), (8.2), (8.3), (8.4) and (8.5), while using the fact that

, we obtain

Remark 8.1. Since it suffices to consider the case when

is bounded in our proof, we can pay ourselves the luxury to use a naive algorithm for polynomial multiplications over

of bounded degree

. It is an interesting question how to do such multiplications more efficiently. First of all, we notice that

via the the change of variables

. The bound (3.2) implies that this change of variables can be performed in time

. In [50], Schost designed an efficient algorithm for multiplication in the truncated power series ring

. The idea is to embed this ring inside

and then to use multipoint evaluation and interpolation at

. It turns out that it suffices to compute with series of precision

in order to obtain the desired product. Altogether, this leads to an algorithm of bit complexity

9.Variants of Theorem 1.2

There are several variants and generalizations of Theorem 1.2 along similar lines as the results in [27, section 8]. First of all, we have the following:

Until now, we have systematically worked in the bit complexity model for implementations on a multitape Turing machine. But it is a matter of routine to adapt Theorem 1.2 to other complexity models. In the straight-line program (SLP) model, the most straightforward adaptation of [27, Theorem 8.4] goes as follows:

Proof. With respect to the proof of [27, Theorem 8.4], the only non-trivial thing to check concerns the number of non-scalar multiplications. These are due to the innermost multiplications of our recursive FFT-multiplications, when unrolling the recursion. Assume first that

. Then one level of our recursion reduces the multiplication of two polynomials in

multiplications in

, whence to

multiplications in

. By construction, we have

and

. Each recursive step therefore expands the data size by a factor at most

while passing from degree

to degree roughly

. The algorithm therefore requires

recursive steps and

non-scalar multiplications. In the case when

, the reasoning is similar: this time, the data size is expanded by a factor at most

at each recursive step (since

), whereas the degree in

descends from

to at most

. This leads to the announced complexity bound.

We notice that the number of non-scalar multiplications is far larger than

, as for the algorithm from [27, Theorem 8.4]. Ideally speaking, we would like to take

exponentially smaller than

, which requires a sharper bound on

than the one from Lemma 5.4. Assuming that this is possible (modulo stronger number theoretic assumptions), the number of recursive steps would go down to

, and the number of non-scalar multiplications to

. With more work, it is plausible that the constant

can be further reduced.

It is also interesting to observe that the bulk of the other

-vector space operations are additions and subtractions: for some constant

with

, the algorithm only uses

scalar multiplications. In a similar way as above, this number goes down to

modulo stronger number theoretic assumptions. For bounded

, we also notice that scalar multiplications theoretically reduce to

additions.

Appendix A.Turing machine implementations

Recall that all computations in this paper are done in the deterministic multitape Turing model [16]. In this appendix, we briefly review the cost of various basic types of data rearrangements when working in this model.

A.1.Arrays and sorting

Let

be a data type whose instances take

bits. An

array

of elements in

is stored as a linear array of

bits. We generally assume that the elements are ordered lexicographically by

What is significant from a complexity point of view is that occasionally we must switch representations, to access an array (say 2-dimensional) by “rows” or by “columns”. In the Turing model, we may transpose an

matrix of elements in

in time

using the algorithm of [5, appendix]. Briefly, the idea is to split the matrix into two halves along the “short” dimension, and transpose each half recursively.

We will also require more complex rearrangements of data, for which we resort to sorting. Suppose that we can compare elements of

in linear time

. Then an array of

elements of

may be sorted in time

using merge sort [36], which can be implemented efficiently on a Turing machine.

Let us now consider a general permutation

. We assume that

is represented by the array

, which takes space

on a Turing tape. Given a general data type

, the permutation naturally acts on arrays of length with entries in

. As long as

, this action can be computed in time

, by tagging the

-th entry of the array by

, sorting the tagged entries, and then removing the tags.

Unfortunately, for some of the data rearrangements in this paper, we do not have

. Nevertheless, certain special kinds of permutations can still be performed in time

. Transpositions are one important example. In the next subsections, we will consider a few other examples.

A.2.Basic multivariate rewritings

Proof. Let

be an absolute constant with

for all

. Let us prove by induction on

that

. For

, the result is clear, so assume that

. Let

be the mapping with

and

for

. Setting

and

, the natural isomorphism between

and

corresponds to the transposition of an

matrix with coefficients in

. Since conversions between

and

amount to

such transpositions, it follows that

A.3.CRT transforms

We start with two lemmas from [23, section 2.4]. For convenience we reproduce the proofs here.

denote the homomorphism that maps

, and acts as the identity on

. Suppose that we wish to compute

for some input polynomial

Interpreting the list

as an

array, the

-th entry corresponds to

. After transposing the array, which costs

bit operations, we have an

array, whose

-th entry is

. Now for each

, cyclically permute the

-th row by

slots; altogether this uses only

bit operations. The result is an

array whose

-th entry is

, which is exactly the coefficient of

the

-th of which may be computed in

bit operations for some universal constant

. The overall cost is

bit operations.

A.4.Multivariate CRT transforms

Proof. We prove the assertion by induction over

. For

, the result follows from Lemma A.5, so assume that

. Denote

, and

. Lemma A.5 yields

, whence

Given

, we recall that the classification of finite abelian groups implies the existence of a unique tuple

with

, and

Proof. Let

and

be the tuples obtained when performing the entry-wise prime power factorizations of

and

as in Lemma A.6. From Lemma A.6 it follows that

and

. Since

is a permutation of

, the result follows from Lemma A.1.

A.5.Tensor products of direct sums

Consider an

-algebra of the form

, where

is finite and totally ordered

. Elements

are of the form

with

, so we may naturally represent them on a Turing tape by concatenating the representations of

. In particular, if each

is an

-algebra of dimension

whose elements are represented by vectors of size

with entries in

, then elements in

are simply represented as vectors of size

with entries in

Now consider

such algebras

and let us form their tensor product

. Let

be the dimension of the

-subalgebra

for

and

. Then elements in

are recursively represented as vectors of size

with entries in

. Using the distributivity of

over

, we have a natural isomorphism

When endowing the index set

with the lexicographical ordering, this isomorphism requires some reordering of data on Turing tapes, but we can compute it reasonably efficiently as follows:

Proof. Let us prove the assertion by induction on

. For

, we have nothing to do, so assume that

. We may consider an element

as a vector of size

with coefficients in

. For each

, we may compute the projection

in time

. Doing this for all

entries of

, we obtain an element

This computation takes a time

. Using the induction hypothesis, we can convert

into an element

in time

. Doing this in order for all

and concatenating the resulting

, we obtain the conversion of

as an element of

. The total computation time is bounded by

The conversion in the opposite direction can be computed in the same time by reversing the algorithm.

Bibliography

[1]: R. Agarwal and J. Cooley. New algorithms for digital convolution. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(5):392–410, 1977.
[2]: M. Agrawal, N. Kayal, and N. Saxena. PRIMES is in P. Annals of Math., 160(2):781–793, 2004.
[3]: P. Barrett. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In A. M. Odlyzko, editor, Advances in Cryptology — CRYPTO' 86, pages 311–323. Berlin, Heidelberg, 1987. Springer Berlin Heidelberg.
[4]: L. I. Bluestein. A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics, 18(4):451–455, 1970.
[5]: A. Bostan, P. Gaudry, and É. Schost. Linear recurrences with polynomial coefficients and application to integer factorization and Cartier-Manin operator. SIAM J. Comput., 36:1777–1806, 2007.
[6]: C. B. Boyer. A History of Mathematics. Princeton Univ. Press, First paperback edition, 1985.
[7]: R. P. Brent and P. Zimmermann. Modern Computer Arithmetic. Cambridge University Press, 2010.
[8]: D. G. Cantor and E. Kaltofen. On fast multiplication of polynomials over arbitrary algebras. Acta Informatica, 28:693–701, 1991.
[9]: J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Computat., 19:297–301, 1965.
[10]: S. Covanov and E. Thomé. Fast integer multiplication using generalized Fermat primes. Math. Comp., 88:1449–1477, 2019.
[11]: A. De, P. P. Kurur, C. Saha, and R. Saptharishi. Fast integer multiplication using modular arithmetic. SIAM J. Comput., 42(2):685–699, 2013.
[12]: M. Fürer. Faster integer multiplication. In Proceedings of the Thirty-Ninth ACM Symposium on Theory of Computing, STOC 2007, pages 57–66. New York, NY, USA, 2007. ACM Press.
[13]: M. Fürer. Faster integer multiplication. SIAM J. Comput., 39(3):979–1005, 2009.
[14]: J. von zur Gathen and J. Gerhard. Modern computer algebra. Cambridge University Press, New York, 3rd edition, 2013.
[15]: I. J. Good. The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society, Series B. 20(2):361–372, 1958.
[16]: C. H.Papadimitriou. Computational Complexity. Addison-Wesley, 1994.
[17]: D. Harvey. Faster arithmetic for number-theoretic transforms. J. Symbolic Comput., 60:113–119, 2014.
[18]: D. Harvey. Faster truncated integer multiplication. https://arxiv.org/abs/1703.00640, 2017.
[19]: D. Harvey and J. van der Hoeven. Faster integer and polynomial multiplication using cyclotomic coefficient rings. Technical Report, ArXiv, 2017. http://arxiv.org/abs/1712.03693.
[20]: D. Harvey and J. van der Hoeven. Faster integer multiplication using plain vanilla FFT primes. Math. Comp., 88(315):501–514, 2019.
[21]: D. Harvey and J. van der Hoeven. Faster integer multiplication using short lattice vectors. In R. Scheidler and J. Sorenson, editors, Proc. of the 13-th Algorithmic Number Theory Symposium, Open Book Series 2, pages 293–310. Mathematical Sciences Publishes, Berkeley, 2019.
[22]: D. Harvey and J. van der Hoeven. Integer multiplication in time . Technical Report, HAL, 2019. http://hal.archives-ouvertes.fr/hal-02070778.
[23]: D. Harvey and J. van der Hoeven. Faster polynomial multiplication over finite fields using cyclotomic coefficient rings. Shortened version of [19]. Accepted for publication in J. of Complexity.
[24]: D. Harvey, J. van der Hoeven, and G. Lecerf. Faster polynomial multiplication over finite fields. Technical Report, ArXiv, 2014. http://arxiv.org/abs/1407.3361. Preprint version of [27].
[25]: D. Harvey, J. van der Hoeven, and G. Lecerf. Even faster integer multiplication. Journal of Complexity, 36:1–30, 2016.
[26]: D. Harvey, J. van der Hoeven, and G. Lecerf. Fast polynomial multiplication over . In Proc. ISSAC '16, pages 255–262. New York, NY, USA, 2016. ACM.
[27]: D. Harvey, J. van der Hoeven, and G. Lecerf. Faster polynomial multiplication over finite fields. J. ACM, 63(6), 2017. Article 52.
[28]: D. R. Heath-Brown. Zero-free regions for Dirichlet L-functions, and the least prime in an arithmetic progression. Proc. London Math. Soc., 64(3):265–338, 1992.
[29]: M. T. Heideman, D. H. Johnson, and C. S. Burrus. Gauss and the history of the fast Fourier transform. Arch. Hist. Exact Sci., 34(3):265–277, 1985.
[30]: J. van der Hoeven. Relax, but don't be too lazy. JSC, 34:479–542, 2002.
[31]: J. van der Hoeven. Faster Chinese remaindering. Technical Report, HAL, 2016. http://hal.archives-ouvertes.fr/hal-01403810.
[32]: J. van der Hoeven, R. Larrieu, and G. Lecerf. Implementing fast carryless multiplication. In J. Blömer, I. S. Kotsireas, T. Kutsia, and D. E. Simos, editors, Proc. MACIS 2017, Vienna, Austria, Lect. Notes in Computer Science, pages 121–136. Cham, 2017. Springer International Publishing.
[33]: A. A. Karatsuba. The complexity of computations. Proc. of the Steklov Inst. of Math., 211:169–183, 1995. English translation; Russian original at pages 186–202.
[34]: A. Karatsuba and J. Ofman. Multiplication of multidigit numbers on automata. Soviet Physics Doklady, 7:595–596, 1963.
[35]: D. E. Knuth. The Art of Computer Programming, volume 2: Seminumerical Algorithms. Addison-Wesley, 1969.
[36]: D. E. Knuth. The Art of Computer Programming, volume 3: Sorting and Searching. Addison-Wesley, Reading, MA, 1998.
[37]: S. Lang. Algebra. Springer Verlag, 2002.
[38]: J. Li, K. Pratt, and G. Shakan. A lower bound for the least prime in an arithmetic progression. The Quarterly Journal of Mathematics, 68(3):729–758, 2017.
[39]: Yu. V. Linnik. On the least prime in an arithmetic progression I. The basic theorem. Rec. Math. (Mat. Sbornik) N.S., 15(57):139–178, 1944.
[40]: Yu. V. Linnik. On the least prime in an arithmetic progression II. The Deuring-Heilbronn phenomenon. Rec. Math. (Mat. Sbornik) N.S., 15(57):347–368, 1944.
[41]: O. Neugebauer. The Exact Sciences in Antiquity. Brown Univ. Press, Providence, R.I., 1957.
[42]: H. J. Nussbaumer. Fast polynomial transform algorithms for digital convolution. IEEE Trans. Acoust. Speech Signal Process., 28(2):205–215, 1980.
[43]: H. J. Nussbaumer. Fast Fourier Transforms and Convolution Algorithms. Springer-Verlag, 2nd edition, 1982.
[44]: H. J. Nussbaumer and P. Quandalle. Computation of convolutions and discrete Fourier transforms by polynomial transforms. IBM J. Res. Develop., 22(2):134–144, 1978.
[45]: J. M. Pollard. The fast Fourier transform in a finite field. Mathematics of Computation, 25(114):365–374, 1971.
[46]: C. M. Rader. Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE, 56:1107–1108, 1968.
[47]: A. Schönhage. Multiplikation großer Zahlen. Computing, 1(3):182–196, 1966.
[48]: A. Schönhage. Schnelle Multiplikation von Polynomen über Körpern der Charakteristik 2. Acta Infor., 7:395–398, 1977.
[49]: A. Schönhage and V. Strassen. Schnelle Multiplikation großer Zahlen. Computing, 7:281–292, 1971.
[50]: É. Schost. Multivariate power series multiplication. In Manuel Kauers, editor, ISSAC '05, pages 293–300. New York, NY, USA, 2005. ACM Press.
[51]: V. Shoup. New algorithms for finding irreducible polynomials over finite fields. Math. Comp., 189:435–447, 1990.
[52]: V. Shoup. Searching for primitive roots in finite fields. Math. Comp., 58:369–380, 1992.
[53]: I. Shparlinski. On finding primitive roots in finite fields. Theoret. Comput. Sci., 157(2):273–275, 1996.
[54]: D. E. Smith. History of Mathematics, volume 2. Dover, 1958.
[55]: L. H. Thomas. Using computers to solve problems in physics. Applications of digital computers, 458:42–57, 1963.
[56]: A. L. Toom. The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Mathematics, 4(2):714–716, 1963.
[57]: T. Xylouris. On the least prime in an arithmetic progression and estimates for the zeros of Dirichlet L-functions. Acta Arith., 1:65–91, 2011.

			(1.3)

			(2.1)
			(2.2)

			(2.3)
			(2.4)

			(3.3)
			(3.4)

1.Introduction

1.1.Brief history and related work

1.2.Overview of the new results

2.Univariate arithmetic

2.1.Basic notations

2.2.Discrete Fourier transforms and fast Fourier multiplication

2.3.The Cooley–Tukey FFT

2.4.Nussbaumer polynomial transforms

2.5.Triadic Nussbaumer polynomial transforms

2.6.Bluestein's chirp transform

3.Multivariate arithmetic

3.1.Tensor products

3.2.Vector notation and multivariate polynomials

3.3.Multivariate Fourier transforms

3.4.CRT transforms

3.5.Negacyclic CRT transforms

3.6.Tricyclic CRT transforms

4.Rader reduction

4.1.Univariate Rader reduction

4.2.Univariate Rader transforms

4.3.Multivariate Rader transforms

5.Prime numbers in arithmetic progressions

5.1.The smallest prime number in an arithmetic progression

5.2.Multiple prime numbers in arithmetic progressions

5.3.Forcing coprime multipliers

6.Proof of Theorem 1.1

7.Proof of Theorem 1.2 in characteristic

8.Proof of Theorem 1.2 in characteristic

9.Variants of Theorem 1.2

Appendix A.Turing machine implementations

A.1.Arrays and sorting

A.2.Basic multivariate rewritings

A.3.CRT transforms

A.4.Multivariate CRT transforms

A.5.Tensor products of direct sums

Bibliography