Polynomial multiplication over finite fieldsin time

 David Harvey

 School of Mathematics and Statistics University of New South Wales Sydney NSW 2052 Australia

 Email: d.harvey@unsw.edu.au

 Joris van der Hoeven

 CNRS, Laboratoire d'informatique École polytechnique 91128 Palaiseau Cedex France

 Email: vdhoeven@lix.polytechnique.fr

 June 27, 2022
 Assuming a widely-believed hypothesis concerning the least prime in an arithmetic progression, we show that two -bit integers can be multiplied in time on a Turing machine with a finite number of tapes; we also show that polynomials of degree less than over a finite field with elements can be multiplied in time , uniformly in . Keywords: Integer multiplication, polynomial multiplication, algorithm, complexity bound, FFT, finite field A.C.M. subject classification: G.1.0 Computer-arithmetic A.M.S. subject classification: 68W30, 68Q17, 68W40

## 1.Introduction

Let denote the bit complexity of multiplying two integers of at most bits in the deterministic multitape Turing model [16]. Similarly, let denote the bit complexity of multiplying two polynomials of degree in . Here is the finite field with elements, so that for some prime number and integer . For computations over , we tacitly assume that we are given a monic irreducible polynomial of degree , and that elements of are represented by polynomials in of degree . Notice that multiplication in can also be regarded as “carryless” integer multiplication in base .

The best currently known bounds for and were obtained in [21] and [23]. For constants and , we proved there that

 (1.1) (1.2)

where denotes the natural logarithm of ,

 (1.3)

and the bound (1.2) holds uniformly in .

For the statement of our new results, we need to recall the concept of a “Linnik constant”. Given two integers and with , we define

We say that a number is a Linnik constant if . The smallest currently known Linnik constant is due to Xylouris [57]. It is widely believed that any number is actually a Linnik constant; see section 5 for more details.

In this paper, we prove the following two results:

Theorem 1.1. Assume that there exists a Linnik constant with . Then

Theorem 1.2. Assume that there exists a Linnik constant with . Then

Theorem 1.2 is our main result. Notice that it implies in terms of the bit-size of the inputs. In the companion paper [22], we show that the bound actually holds unconditionally, thereby superseding Theorem 1.1. Since the proof of Theorem 1.1 is a good introduction to the proof of Theorem 1.2 and also somewhat shorter than the proof of the unconditional bound in [22], we decided to include it in the present paper. The assumption being satisfied “in practice”, it is also conceivable that variants of the corresponding algorithm may lead to faster practical implementations.

This companion paper [22] uses some of the techniques from the present paper, as well as a new ingredient: Gaussian resampling. This technique makes essential use of properties of real numbers, explaining why we were only able to apply it to integer multiplication. By contrast, the techniques of the present paper can be used in combination with both complex and number theoretic Fourier transforms. For our presentation, we preferred the second option, which is also known to be suitable for practical implementations [17]. It remains an open question whether Theorem 1.2 can be replaced by an unconditional bound.

In the sequel, we focus on the study of and from the purely theoretical perspective of asymptotic complexity. In view of past experience [17, 26, 32], variants of our algorithms might actually be relevant for machine implementations, but these aspects will be left aside for now.

### 1.1.Brief history and related work

Integer multiplication.The development of efficient methods for integer multiplication can be traced back to the beginning of mathematical history. Multiplication algorithms of complexity were already known in ancient civilizations, whereas descriptions of methods close to the ones that we learn at school appeared in Europe during the late Middle Ages. For historical references, we refer to [54, section II.5] and [41, 6].

The first more efficient method for integer multiplication was discovered in 1962, by Karatsuba [34, 33]. His method is also the first in a family of algorithms that are based on the technique of evaluation-interpolation. The input integers are first cut into pieces, which are taken to be the coefficients of two integer polynomials. These polynomials are evaluated at several well-chosen points. The algorithm next recursively multiplies the obtained integer values at each point. The desired result is finally obtained by pasting together the coefficients of the product polynomial. This way of reducing integer multiplication to polynomial multiplication is also known as Kronecker segmentation [27, section 2.6].

Karatsuba's original algorithm cuts each input integer in two pieces and then uses three evaluation points. This leads to an algorithm of complexity . Shortly after the publication of Karatsuba's method, it was shown by Toom [56] that the use of more evaluation points allowed for even better complexity bounds, with further improvements by Schönhage [47] and Knuth [35].

The development of efficient algorithms for the required evaluations and interpolations at large sets of points then became a major bottleneck. Cooley and Tukey's rediscovery [9] of the fast Fourier transform (FFT) provided the technical tool to overcome this problem. The FFT, which was essentially already known to Gauss [29], can be regarded as a particularly efficient evaluation-interpolation method for special sets of evaluation points. Consider a ring with a principal -th root of unity (see section 2.2 for detailed definitions; if , then one may take ). Then the FFT permits evaluation at the points using only ring operations in . The corresponding interpolations can be done with the same complexity, provided that is invertible in . Given with , it follows that the product can be computed using ring operations as well.

The idea to use fast Fourier transforms for integer multiplication is independently due to Pollard [45] and Schönhage–Strassen [49]. The simplest way to do this is to take and , while approximating elements of with finite precision. Multiplications in itself are handled recursively, through reduction to integer multiplications. This method corresponds to Schönhage and Strassen's first algorithm from [49] and they showed that it runs in time for all . Pollard's algorithm rather works with , where is a prime of the form . The field indeed contains primitive -th roots of unity and FFTs over such fields are sometimes called number theoretic FFTs. Pollard did not analyze the asymptotic complexity of his method, but it can be shown that its recursive use for the arithmetic in leads to a similar complexity bound as for Schönhage–Strassen's first algorithm.

The paper [49] also contains a second method that achieves the even better complexity bound . This algorithm is commonly known as the Schönhage–Strassen algorithm. It works over , where is a Fermat number, and uses as a principal -th root of unity in . The increased speed is due to the fact that multiplications by powers of can be carried out in linear time, as they correspond to simple shifts and negations.

The Schönhage–Strassen algorithm remained the champion for more than thirty years, before being superseded by Fürer's algorithm [12]. In short, Fürer managed to combine the advantages of the two algorithms from [49], to achieve the bound (1.1) for some unspecified constant . In [25, section 7], it has been shown that an optimized version of Fürer's original algorithm achieves . In a succession of works [25, 18, 19, 21], new algorithms were proposed for which , , , and . In view of the companion paper [22], we now know that .

Historically speaking, we also notice that improvements on the lowest possible values of and were often preceded by improvements modulo suitable number theoretic assumptions. In [25, section 9], we proved that one may take under the assumption that there exist “sufficiently many” Mersenne primes [25, Conjecture 9.1]. The same value was achieved in [10] modulo the existence of sufficiently many generalized Fermat primes. In [20], we again reached under the assumption that , where is the Euler totient function.

Polynomial multiplication.To a large extent, the development of more efficient algorithms for polynomial multiplication went hand in hand with progress on integer multiplication. The early algorithms by Karatsuba and Toom [34, 56], as well as Schönhage–Strassen's second algorithm from [49], are all based on increasingly efficient algebraic methods for polynomial multiplication over more or less general rings .

More precisely, let be the number of ring operations in that are required for the multiplication of two polynomials of degree in . A straightforward adaptation of Karatsuba's algorithm for integer multiplication gives rise to the algebraic complexity bound .

The subsequent methods typically require mild hypotheses on : since Toom's algorithm uses more than three evaluation points, its polynomial analogue requires to contain sufficiently many points. Similarly, Schönhage–Strassen multiplication is based on “dyadic” FFTs and therefore requires to be invertible in . Schönhage subsequently developed a “triadic” analogue of their method that is suitable for polynomial multiplication over fields of characteristic two [48]. The bound for general (not necessarily commutative) rings is due to Cantor and Kaltofen [8].

Let us now turn to the bit complexity of polynomial multiplication over a finite field with and where is prime. Using Kronecker substitution, it is not hard to show that

which allows us to reduce our study to the particular case when and .

Now the multiplication of polynomials in of small degree can be reduced to integer multiplication using Kronecker substitution: the input polynomials are first lifted into polynomials with integers coefficients in , and then evaluated at . The desired result can finally be read off from the integer product of these evaluations. If , this yields

 (1.4)

On the other hand, for , adaptation of the algebraic complexity bound to the Turing model yields

 (1.5)

where the first term corresponds to additions/subtractions in and the second one to multiplications. Notice that the first term dominates for large . The combination of (1.4) and (1.5) also implies

 (1.6)

For almost forty years, the bound (1.5) remained unbeaten. Since Fürer's algorithm (alike Schönhage–Strassen's first algorithm from [49]) essentially relies on the availability of suitable roots of unity in the base field , it admits no direct algebraic analogue. In particular, the existence of a Fürer-type bound of the form (1.2) remained open for several years. This problem got first solved in [27] for , but using an algorithm that is very different from Fürer's algorithm for integer multiplication. Under suitable number theoretic assumptions, it was also shown in the preprint version [24] that one may take , a result that was achieved subsequently without these assumptions [23].

Let us finally notice that it usually a matter of routine to derive better polynomial multiplication algorithms over for integers from better multiplication algorithms over . These techniques are detailed in [27, sections 8.1–8.3]. Denoting by the bit complexity of multiplying two polynomials of degree over , it is shown there that

 (1.7)

uniformly in , and for . Exploiting the fact that finite fields only contain a finite number of elements, it is also a matter of routine to derive algebraic complexity bounds for in the case when is a ring of finite characteristic. We refer to [27, sections 8.4 and 8.5] for more details.

Related tools.In order to prove Theorems 1.1 and 1.2, we will make use of several other tools from the theory of discrete Fourier transforms (DFTs). First of all, given a ring and a composite integer such that the are mutually coprime, the Chinese remainder theorem gives rise to an isomorphism

This was first observed in [1] and we will call the corresponding conversions CRT transforms. The above isomorphism also allows for the reduction of a univariate DFT of length to a multivariate DFT of length . This observation is older and due to Good [15] and independently to Thomas [55].

Another classical tool from the theory of discrete Fourier transforms is Rader reduction [46]: a DFT of prime length over a ring can be reduced to the computation of a cyclic convolution of length , i.e. a product in the ring . In section 4, we will also present a multivariate version of this reduction. In a similar way, one may use Bluestein's chirp transform [4] to reduce a DFT of general length over to a cyclic convolution of the same length. Whereas FFT-multiplication is a technique that reduces polynomial multiplications to the computation of DFTs, Bluestein's algorithm (as well as Rader's algorithm for prime lengths) can be used for reductions in the opposite direction. In particular, Theorem 1.2 implies that a DFT of length over a finite field can be computed in time on a Turing machine.

Nussbaumer polynomial transforms [42, 43] constitute yet another essential tool for our new results. In a similar way as in Schönhage–Strassen's second algorithm from [49], the idea is to use DFTs over a ring with . This ring has the property that is a -th principal root of unity and that multiplications by powers of can be computed in linear time. In particular, DFTs of length can be computed very efficiently using only additions and subtractions. Nussbaumer's important observation is that such transforms are especially interesting for the computation of multidimensional DFTs (in [49] they are only used for univariate DFTs). Now the lengths of these multidimensional DFTs should divide in each direction. Following a suggestion by Nussbaumer and Quandalle in [44, p. 141], this situation can be forced using Rader reduction and CRT transforms.

### 1.2.Overview of the new results

In order to prove Theorems 1.1 and 1.2, we make use of a combination of well-known techniques from the theory of fast Fourier transforms. We just surveyed these related tools; more details are provided in sections 2, 3 and 4. Notice that part of section 2 contains material that we adapted from previous papers [25, 27] and included for convenience of the reader. In section 5, the assumption on the existence of small Linnik constants will also be discussed in more detail, together with a few consequences. Throughout this paper, all computations are done in the deterministic multitape Turing model [16] and execution times are analyzed in terms of bit complexity. The appendix covers technical details about the cost of data rearrangements when using this model.

Integer multiplication.We prove Theorem 1.1 in section 6. The main idea of our new algorithm is to reduce integer multiplication to the computation of multivariate cyclic convolutions in a suitable algebra of the form

Here is a power of two and are the first prime numbers in the arithmetic progression , where is another power of two with . Setting , we choose such that there exists a principal -th root of unity in that makes it possible to compute products in the algebra using fast Fourier multiplication. Concerning the dimension , it turns out that we may take , although larger dimensions may allow for speed-ups by a constant factor as long as can be kept small.

Using multivariate Rader transforms, the required discrete Fourier transforms in essentially reduce to the computation of multivariate cyclic convolutions in the algebra

By construction, we may factor , where is an odd number that is small due the hypothesis on the Linnik constant. Since , we notice that contains a primitive -th root of unity. CRT transforms allow us to rewrite the algebra as

Now the important observation is that can be considered as a “fast” principal -th root of unity in both and . This means that products in can be computed using multivariate Fourier multiplication with the special property that the discrete Fourier transforms become Nussbaumer polynomial transforms. Since is a power of two, these transforms can be computed in time . Moreover, for sufficiently small Linnik constants , the cost of the “inner multiplications” in can be mastered so that it only marginally contributes to the overall cost.

Polynomial multiplication.In order to prove Theorem 1.2, we proceed in a similar manner; see sections 7 and 8. It suffices to consider the case that is prime. In section 7, we first focus on ground fields of characteristic . This time, the ring needs to be replaced by a suitable extension field of ; in particular, we define . More precisely, we take , which ensures the existence of primitive -th roots of unity in and .

The finite field case gives rise to a few additional technical difficulties. First of all, the bit size of an element of is exponentially larger than the bit size of an element of . This makes the FFT-approach for multiplying elements in not efficient enough. Instead, we impose the additional requirement that be pairwise coprime. This allows us to reduce multiplications in to univariate multiplications in , but at the expense of the much stronger hypothesis on .

A second technical difficulty concerns the computation of a defining polynomial for (i.e. an irreducible polynomial in of degree ), together with a primitive -th root of unity . For our choice of as a function of , and thanks to work by Shoup [51, 52], it turns out that both and can be precomputed in linear time .

The case when finally comes with its own particular difficulties, so we treat it separately in section 8. Since is no longer invertible, we cannot use DFT lengths that are powers of two. Following Schönhage [48], we instead resort to “triadic” FFTs, for which becomes a power of three. More annoyingly, since is necessarily even for , this also means that we now have to take . In a addition to the multivariate cyclic convolutions of lengths and from before, this gives rise to multivariate cyclic convolutions of length , which need to be handled separately. Although the proof of Theorem 1.2 in the case when is very similar to the case when , the above complications lead to subtle changes in the choices of parameters and coefficient rings. For convenience of the reader, we therefore reproduced most of the proof from section 7 in section 8, with the required adaptations.

Variants.The constants and in Theorems 1.1 and 1.2 are not optimal: we have rather attempted to keep the proofs as simple as possible. Nevertheless, our proofs do admit two properties that deserve to be highlighted:

• The hypothesis that there exists a Linnik constant that is sufficiently close to one is enough for obtaining our main complexity bounds.

• The dimension of the multivariate Fourier transforms can be kept bounded and does not need to grow with .

We refer to Remarks 6.1 and 7.1 for some ideas on how to optimize the thresholds for acceptable Linnik constants.

Our main algorithms admit several variants. It should for instance be relatively straightforward to use approximate complex arithmetic in section 6 instead of modular arithmetic. Due to the fact that primitive roots of unity in are particularly simple, the Linnik constants may also become somewhat better, but additional care is needed for the numerical error analysis. Using similar ideas as in [27, section 8], our multiplication algorithm for polynomials over finite fields can also be adapted to polynomials over finite rings and more general effective rings of positive characteristic.

Another interesting question is whether there exist practically efficient variants that might outperform existing implementations. Obviously, this would require further fine-tuning, since the “crazy” constants in our proofs were not motivated by practical applications. Nevertheless, our approach combines several ideas from existing algorithms that have proven to be efficient in practice. In the case of integer multiplication, this makes us more optimistic than for the previous post-Fürer algorithms [12, 13, 11, 25, 10, 18, 19, 21]. In the finite field case, the situation is even better, since similar ideas have already been validated in practice [26, 32].

Applications.Assume that there exist Linnik constants that are sufficiently close to . Then Theorems 1.1 and 1.2 have many immediate consequences, as many computational problems may be reduced to integer or polynomial multiplication.

For example, Theorem 1.1 implies that the Euclidean division of two -bit integers can be performed in time , whereas the gcd can be computed in time . Many transcendental functions and constants can also be approximated faster. For instance, binary digits of can be computed in time and the result can be converted into decimal notation in time . For details and more examples, we refer to [7, 31].

In a similar way, Theorem 1.2 implies that the Euclidean division of two polynomials in with of degree can be performed in time , whereas the extended gcd admits bit-complexity . Since inversion of a non-zero element of boils down to an extended gcd computation of degree over , this operation admits bit complexity . The complexities of many operations on formal power series over can also be expressed in terms of . We refer to [14, 30] for details and more applications.

## 2.Univariate arithmetic

### 2.1.Basic notations

Let be a commutative ring with identity. Given , we define

Elements of and will be called cyclic polynomials and negacyclic polynomials, respectively. For subsets , it will be convenient to write , and extend the above notations likewise. This is typically useful in the case when for some other ring and for some .

In our algorithms, we will only consider effective rings , whose elements can be represented using data structures of a fixed bitsize and such that algorithms are available for the ring operations. We will always assume that additions and subtractions can be computed in linear time and we denote by the bit complexity of multiplication in . For some fixed invertible integer , we sometimes need to divide elements of by ; we define to be the cost of such a division (assuming that a pre-inverse of has been precomputed). If with and prime, then we have , , and , where .

We write for the bit complexity of multiplying two polynomials in . We also define and . Multiplications in and are also called cyclic convolutions and negacyclic convolutions. Since reduction modulo can be performed using additions, we clearly have

 (2.1) (2.2)

We also have

 (2.3) (2.4)

since for all .

In this paper we will frequently convert between various representations. Given a computable ring morphism between two effective rings and , we will write for the cost of applying to an element of . If is an embedding of into that is clear from the context, then we will also write instead of . If and are isomorphic, with morphisms and for performing the conversions that are clear from context, then we define .

### 2.2.Discrete Fourier transforms and fast Fourier multiplication

Let be a ring with identity and let be such that for some . Given a vector , we may form the polynomial , and we define the discrete Fourier transform of with respect to to be the vector with for all .

Assume now that is a principal -th root of unity, meaning that

 (2.5)

for all . Then is again a principal -th root of unity, and we have

 (2.6)

Indeed, writing , the relation (2.5) implies that

where if and otherwise.

Consider any -th root of unity . Given a cyclic polynomial that is presented as the class of a polynomial modulo , the value does not depend on the choice of . It is therefore natural to consider as the evaluation of at . It is now convenient to extend the definition of the discrete Fourier transform and set . If is invertible in , then becomes a ring isomorphism between the rings and . Indeed, for any and , we have . Furthermore, the relation (2.6) provides us with a way to compute the inverse transform .

In particular, denoting by the cost of computing a discrete Fourier transform of length with respect to , we get

Since and coincide up to a simple permutation, we also notice that . Setting for some “privileged” principal -th root of unity (that will always be clear from the context), it follows that .

A well known application of this isomorphism is to compute cyclic convolutions for in using the formula

It follows that

Computing cyclic convolutions in this way is called fast Fourier multiplication.

### 2.3.The Cooley–Tukey FFT

Let be a principal -th root of unity and let where . Then is a principal -th root of unity and is a principal -th root of unity. Moreover, for any and , we have

 (2.7)

If and are algorithms for computing DFTs of length and , then we may use (2.7) to construct an algorithm for computing DFTs of length as follows.

For each , the sum inside the brackets corresponds to the -th coefficient of a DFT of the -tuple with respect to . Evaluating these inner DFTs requires calls to . Next, we multiply by the twiddle factors , at a cost of operations in . Finally, for each , the outer sum corresponds to the -th coefficient of a DFT of an -tuple in with respect to . These outer DFTs require calls to . On a Turing machine, the application of the inner DFTs also requires us to reorganize into consecutive vectors of size , and vice versa. This can be done in time using matrix transpositions (see section A.1 in the appendix).

Let denote the cost of performing a DFT of length , while assuming that the twiddle factors have been precomputed. Then computing the DFT using the above Cooley–Tukey algorithm yields the bound

For a factorization , an easy induction over then yields

The twiddle factors can obviously be computed in time . Altogether, we obtain

 (2.8)

### 2.4.Nussbaumer polynomial transforms

In the case that for some effective ring and power of two , the indeterminate is itself a principal -th root of unity. Moreover, multiplications by powers of can be computed in linear time since

for all and . For this reason, is sometimes called a fast principal root of unity.

Now assume that we wish to compute a discrete Fourier transform of size over . Then we may use the Cooley–Tukey algorithm from the previous subsection with , , and . When using our faster algorithm for multiplications by twiddle factors, the term in (2.8) can be replaced by . Moreover, discrete Fourier transforms of length just map pairs to pairs , so they can also be computed in linear time . Altogether, the bound (2.8) thus simplifies into

In terms of scalar operations in , this yields

Discrete Fourier transforms over rings of the type considered in this subsection are sometimes qualified as Nussbaumer polynomial transforms.

If is invertible in the effective ring from the previous subsection, then so is , and we can compute the inverse DFT of length using

However, if has characteristic (i.e. in ), then this is no longer possible, and we need to adapt the method if we want to use it for polynomial multiplication.

Following Schönhage [48], the remedy is to use triadic DFTs. More precisely, assuming that , we now work over the ring instead of , where

It is easily verified that is a principal -th root of unity in . Moreover, multiplications of elements in with powers of can still be computed in linear time . Indeed, given and , we have

Multiplications with (resp. ) can be reduced to this case, by regarding them as two multiplications with and (resp. three multiplications with , , and ). Using similar formulas, polynomials in of degree can be reduced in linear time modulo .

Now assume that we wish to compute a discrete Fourier transform of size over . Discrete Fourier transforms of length now map triples to triples , so they can be computed in linear time . Using the Cooley–Tukey algorithm with , , and , similar considerations as in the previous subsection again allows us to conclude that

We will qualify discrete Fourier transforms over rings of this type as triadic Nussbaumer polynomial transforms.

### 2.6.Bluestein's chirp transform

We have shown above how to multiply polynomials using DFTs. Inversely, it is possible to reduce the computation of DFTs — of arbitrary length, not necessarily a power of two — to polynomial multiplication [4], as follows.

Let be a principal -th root of unity. If is even, then we assume that there exists some with . If is odd, then we take , so that and . Consider the sequences

Then , so for any we have

 (2.9)

If is even, then we have

If is odd, then we also have

Now let , and modulo . Then (2.9) implies that for all . In other words, the computation of a DFT of even length reduces to a cyclic convolution product of the same length, together with additional operations in :

 (2.10)

Notice that the polynomial is fixed and independent of in this product. The (and ) can be computed in time via the identity .

## 3.Multivariate arithmetic

### 3.1.Tensor products

It is well known that multivariate Fourier transforms can be interpreted as tensor products of univariate Fourier transforms. Since we will also need another type of multivariate “Rader transform” in the next section, it will be useful to recall this point of view. We refer to [37, Chapter XVI] for an abstract introduction to tensor products.

Let be a commutative ring and let and be two -modules. The tensor product of and is an -module together with a bilinear mapping . This bilinear mapping satisfies the universal property that for any other -module together with a bilinear mapping , there exists a unique linear mapping with . Up to natural isomorphisms, it can be shown that the tensor product operation is associative and distributive with respect to the direct sum operation .

Besides modules, it is also possible to tensor linear mappings. More precisely, given two linear mappings and , the mapping that sends to is clearly bilinear. This means that there exists a unique mapping such that for all . We call the tensor product of and .

Assume now that and are free -modules of finite rank, say and . Then is again a free -module that can be identified with the set of bidimensional arrays of size and with coefficients in . The tensor product of two vectors and is given by the array with entries .

Let and be two more free -modules of finite rank. Assume that we are given two linear mappings and that can be computed by an algorithm. Then can be computed as follows. Given , we first apply to each of the rows . This yields a new array with for all . We next apply to each of the columns . This yields an array with for all . We claim that . Indeed, if , then and , whence the claim follows by linearity.

Given , the above algorithm then allows us to compute in time

 (3.1)

More generally, given linear mappings , a similar complexity analysis as in the case of the Cooley–Tukey algorithm yields

 (3.2)

where for .

### 3.2.Vector notation and multivariate polynomials

For working with multivariate polynomials, it is convenient to introduce a few notations. Let and be -tuples in . We define

Notice that . Elements in can be regarded as arrays. On a Turing tape, we recursively represent them as vectors of size with entries in .

Given a -tuple of indeterminates , we define

 (3.3) (3.4)

The first tensor product is understood in the sense of -modules and we regard as a subset of . The second tensor product is understood in the sense of -algebras. Any polynomial in (or cyclic polynomial in ) can uniquely be written as a sum

where is the vector of coefficients of . We write for the complexity of multiplying two polynomials in and also define .

### 3.3.Multivariate Fourier transforms

Assume now that is such that is a principal -th root of unity for and invertible positive integers . As in the univariate case, cyclic polynomials can naturally be evaluated at points of the form . We now define the discrete Fourier transform of to be the vector with

for all . Multivariate Fourier transforms again provide us with an isomorphism between and . Writing for the cost to compute such a transform with respect to a most suitable , we thus have .

We claim that multivariate Fourier transforms can be regarded as the tensor product of univariate Fourier transforms with respect to each of the variables, that is,

 (3.5)

Indeed, given a vector of the form with for each , we have

 (3.6)

where for each , and is as above. The relation (3.5) follows from (3.6) by linearity, since the of the form span .

In particular, in order to compute a multivariate Fourier transform, it suffices to compute univariate Fourier transforms with respect to each of the variables, as described in section 3.1. It follows that

 (3.7)

The technique of FFT-multiplication from section 2.2 also naturally generalizes to multivariate polynomials in . This yields the bound

 (3.8)

If , as in section 2.4, and all divide , then (3.7) and (3.8) become simply

 (3.9) (3.10)

### 3.4.CRT transforms

Let be a tuple of pairwise coprime numbers and . The Chinese remainder theorem provides us with an isomorphism of abelian groups

Any such isomorphism induces an isomorphism of -algebras

that boils down to a permutation of coefficients when using the usual power bases for and . For a suitable choice of , we show in section A.3 of the appendix that this permutation can be computed efficiently on Turing machines, which yields

We will also need a multivariate generalization of this formula: let , and be such that and are coprime for . Then

This is a result of Corollary A.8 in the appendix.

### 3.5.Negacyclic CRT transforms

Assume now that and let , where and are coprime and is odd. Then we have isomorphisms of -algebras

Given , the map determines an isomorphism of abelian groups between and . As explained in the previous subsection, this gives rise to an isomorphism of -algebras

In view of Lemma A.4, this isomorphism can be computed efficiently. Moreover, , since is odd and . This yields a natural isomorphism of -algebras such that the following diagram commutes:

The projection admits an obvious right inverse that sends a polynomial modulo to . This lift can clearly be computed in linear time, so we again obtain

Alternatively, one may compute in a similar way as the isomorphism of -algebras between from the previous subsection, while inverting signs whenever appropriate.

### 3.6.Tricyclic CRT transforms

In section 8, we will also need a triadic variant of the negacyclic CRT transforms from the previous subsection. More precisely, assume that , where and are coprime, is even, and is not divisible by . Then we have isomorphisms of -algebras

Given , the map determines an isomorphism of abelian groups between and . In a similar way as in the previous subsection, this leads to an isomorphism of -algebras

and a commutative diagram

in which all maps and their right inverses can be computed efficiently. This again yields

We will qualify this kind of conversions between and as tricyclic CRT transforms.

Let be a ring and let be a principal -th root of unity for some prime number . The multiplicative group is cyclic, of order . Let be such that is a generator of this cyclic group. Given a polynomial and with , we have

Setting

it follows that , when regarding , and as elements of , since

Notice that the vector can be precomputed since it does not depend on .

In summary, we have reduced the computation of a discrete Fourier transform of length to a cyclic convolution of length and additions in . This reduction is due to Rader [46], except for the minor twist that we subtract in the definition of . As far as we are aware, this twist is essentially due to Nussbaumer [43, section 5.2.2]. In the next subsection, it allows us to compute Rader transforms without increasing the dimension from to .

Let us now show how to present Rader reduction using suitable linear transforms, called left, right and inverse Rader transforms (LRT, RRT and IRT). These transforms depend on the choice of ; from now on, we assume that has been fixed once and for all. Given and , let , and be as in section 4.1. Setting

we first introduce the distinguished element by

We also define by

so that is a linear mapping. We can retrieve from the product using

We denote by the linear mapping with . Altogether, we have

The linear mappings and boil down to permutations and a linear number of additions and subtractions. Now permutations can be computed efficiently on Turing machines, as recalled in section A.1 of the appendix. More precisely, since is a principal root of unity of order , we necessarily have . Then the permutation of coefficients in of bitsize can be done using a merge sort in time , whence

It follows that

 (4.1)

In this bound, the cost corresponds to the trivial multiplication . We included this term in anticipation of the multivariate generalization (4.3) below. We also recall that we assumed to be precomputed. This precomputation can be done in time : we first compute the powers for and then permute them appropriately.

Assume now that are prime numbers, that we have fixed generators for the cyclic groups , and that are such that is a principal -th root of unity for . Let , , and . We define

Using the distributivity of with respect to , the algebra is a direct sum of algebras of cyclic multivariate polynomials. More precisely, let . For each index , let be such that if and if for . Then

Notice that

Elements in correspond to sums with . Assuming the lexicographical ordering on , such sums are represented on a Turing tape by concatenating the representations of the components : see section A.5 in the appendix for details. We have

 (4.2)

by Lemma A.9 (we may take ).

The tensor product of the mappings provides us with a mapping

Similarly, we obtain a mapping

and a distinguished element

For any and , we now have

By linearity, it follows that

for all .

From the complexity point of view, the relation (3.2) implies

Using (4.2) and , this leads to the following multivariate analogue of (4.1):

 (4.3)

As in the case of (4.1), this bound does not include the cost of the precomputation of . This precomputation can be performed in time using the same “list and sort” technique as in the univariate case.

## 5.Prime numbers in arithmetic progressions

### 5.1.The smallest prime number in an arithmetic progression

Let be the set of all prime numbers. Given two integers and with , we define and by

A constant with the property that is called a Linnik constant (notice that we allow the implicit constants and with for to depend on ).

The existence of a Linnik constant was shown by Linnik [39, 40]. The smallest currently known value is due to Xylouris [57]. Assuming the Generalized Riemann Hypothesis (GRH), it is known that any is a Linnik constant [28]. Unfortunately, these bounds are still too weak for our purposes.

Both heuristic probabilistic arguments and numerical evidence indicate that satisfies the much sharper bound , where stands for Euler's totient function. More precisely, based on such evidence [38], Li, Pratt and Shakan conjecture that

Unfortunately, a proof of this conjecture currently seems to be out of reach. At any rate, this conjecture is much stronger than the assumption that in Theorem 1.2.

### 5.2.Multiple prime numbers in arithmetic progressions

Given integers , , and with , let denote the -th prime number in the arithmetic progression , so that . Also define

Lemma 5.1. Let be a Linnik constant. Then for any fixed integer , we have

Remark. Note that the implied big-Oh constant depends on ; a similar remark applies to Lemma 5.4 below.

Proof. Let be an absolute constant with for all . Let be given and let be a fixed remainder class with .

Let be the smallest prime number such that and . The prime number theorem implies that . For each , let be such that and . Such an integer exists since and are coprime. Notice also that and are coprime. Now consider the prime numbers for . By construction, we have , so the numbers are pairwise distinct. We also have and , whence .

Remark 5.2. For numbers with a bounded number of prime factors, we actually have in the proof of Lemma 5.1, which leads to the sharper bound . It is possible that the bound always holds, but we were unable to prove this so far. If is no longer fixed, then it is not hard to see that , which yields . Concerning a generalization of the Li–Pratt–Shakan conjecture, it is likely that there exist constants and with

It is actually even likely that and . At any rate, it would be interesting to empirically determine and via numerical experiments.

### 5.3.Forcing coprime multipliers

When searching for multiple prime numbers in an arithmetic progression, as in the previous subsection, it will be useful in sections 7 and 8 to insist that the multipliers be pairwise coprime. The simplest way to force this is to define the by induction for using

 (5.1)

(In Lemma 5.3 below, we will show that such a value of exists for all , provided that is even.) We then define and

Lemma 5.3. Let be a Linnik constant and assume that is even. Given an integer , there exists an integer such that is prime, , and

Proof. Modulo the replacement of by , we may assume without loss of generality that . First we show that for each prime , there exists an integer such that and . Indeed, if then so we may take ; if , then , so at least one satisfies and . Now, using the Chinese remainder theorem, we may construct an integer such that and for all . Then we have and . Since or , we also must have .

Now , since and . It follows that is well defined. Let be the integer with and take , so that . We conclude by observing that , whence .

Lemma 5.4. Let be a Linnik constant and assume that is even. Then for any fixed integer , we have

Proof. Let us prove by induction on that defined as in (5.1) satisfies

Indeed, setting , the induction hypothesis implies

Lemma 5.3 then yields

whence the result.

Remark 5.5. Our bound for seems overly pessimistic. It is possible that a bound or always holds, but we were unable to prove this so far. It would also be interesting to investigate counterparts of the Li–Pratt–Shakan conjecture.

## 6.Proof of Theorem 1.1

Let be a Linnik constant with . Our algorithm proceeds as follows.

Step 0: setting up the parameters. Assume that we wish to multiply two -bit integers. For sufficiently large and suitable parameters , , , we will reduce this problem to a multiplication problem in . If , then any traditional multiplication algorithm can be used. The threshold should be chosen such that various asymptotic relations encountered in the proof below hold for all . The other parameters are chosen as follows:

• .

• is the smallest power of two with .

• for

• .

• , where .

• and .

• .

• is maximal with .

We claim that and the rest of these parameters can be computed in linear time . Recall from [2] that we can test whether a -bit integer is prime in time . We clearly have , whence for , by Lemma 5.1. For sufficiently large , this means that

The brute force computation of requires primality tests of total cost . From , we also deduce and . In particular, ; for sufficiently large, we thus get

In order to compute , we need to perform primality tests of total cost . This completes the proof of our claim.

Let us proceed with a few useful bounds. From we get

which in turn yields

Since , we also have

and , for sufficiently large . Inversely, implies

Notice finally that

Step 1: Kronecker segmentation. For sufficiently large , the last inequality implies that . In order to multiply two -bit integers and , we expand them in base , i.e.

and form the polynomials

Let be the reductions of these polynomials modulo and . Since and for , we can read off the product from the product , after which . Computing from and clearly takes linear time , and so does the retrieval of and from . In other words, we have shown that

 (6.1)

Step 2: CRT transforms. At a second stage, we use negacyclic CRT transforms (section 3.5) followed by traditional CRT transforms (section 3.4) in order to reduce multiplication in to multiplication in , where , , and :

Since , this yields

The computation of a generator of the cyclic group can be done in time , by [53]. The lift of the -th power of this generator to can be computed in time and yields a primitive -th root of unity . We perform the multivariate cyclic convolutions in using fast Fourier multiplication, where the discrete Fourier transforms are computed with respect to . In view of (3.8), this leads to

Indeed, scalar divisions by integers in can be performed in time

using Schönhage–Strassen's algorithm for integer multiplication. Now we observe that and , whence , again using Schönhage–Strassen multiplication. Performing multiplications in using Kronecker multiplication, it also follows that

for sufficiently large . We conclude that

 (6.2)

Step 3: multivariate Rader reduction. We compute the multivariate discrete Fourier transforms using Rader reduction. For each , let be such that if and if for . Then (4.3) implies

The necessary precomputations for each of the Rader reductions can be performed over instead of . Using Schönhage–Strassen multiplication, this can be done in time

since

Now for any , we have . Performing multiplications in using -fold Kronecker substitution and Cantor–Kaltofen's variant of Schönhage–Strassen multiplication over , this yields

whence

Since and , the right hand side is bounded by , whence

 (6.3)

Step 4: second CRT transforms. For , we may decompose , where is odd. Using multivariate CRT transforms (Corollary A.8), this yields

Setting and , this relation can be rewritten as

 (6.4)

Step 5: Nussbaumer polynomial transforms. Since , we have a fast root of unity of order in and in . We may thus compute polycyclic convolutions of order over using fast Fourier multiplication, with Nussbaumer polynomial transforms in the role of discrete Fourier transforms. From (3.10), it follows that

Here we used the bound , which is shown in a similar way as above. Modulo a change of variables where is a principal -th root of unity in , negacyclic convolutions in reduce to cyclic convolutions in . Combining this with the above relations, this yields

The bound follows from (1.1).

Step 6: residual cyclic convolutions. By construction, contains principal roots of unity of orders , which can again be computed in time using a similar method as above. We may thus compute cyclic convolutions of order over using fast Fourier multiplication. Since and , we obtain

Using Bluestein reduction (2.10), the univariate discrete Fourier transforms can be transformed back into cyclic products:

We perform the cyclic products using Kronecker substitution. Since is increasing (by definition), this yields

 (6.5)

Step 7: recurse. Combining the relations (6.1), (6.2), (6.3), (6.4) and (6.5), while using the fact that for sufficiently large , we obtain

Using that , this relation simplifies to

In terms of the normalized cost function

and , this relation becomes

By Lemma 5.1, we have for all . For sufficiently large , this means that , whence and . Using that , the latest bound on therefore further simplifies into

for some suitable universal constant . For sufficiently large , using our assumptions , we have and , whence

Let us show by induction over that this implies for all . This is clear for , so assume that . Then the above inequality yields

Consequently, . This completes the induction and we conclude that , whence .

Remark 6.1. In our proof, we have not attempted to optimize the constant . With a bit more work, it should be possible to achieve any constant larger than , as follows:

• On three occasions, we used the crude estimate for large ; this explains the increase from to .

• Taking , the cost of the pointwise multiplications in (6.2) becomes negligible, so there is no further increase from to in the last step. Since this also means that the dimension is no longer constant, a few adaptations are necessary. For instance, we need Remark 5.2 that and at the end of step 3, we only have .

• Taking one gains a factor .

• In recursive FFT-multiplications, one argument is always fixed (due to the fact that one of the multipliers in the Bluestein and Rader reductions is fixed). In a similar way as in [25], this makes it possible to save one fast Fourier transform out of three. Since this observation can be used twice in steps 2 and 6, a further factor of can be gained in this way.

• In step 6, we used the crude bound . Replacing the right-hand side by , we save another factor .

• By allowing to contain some small odd factors, we may further improve the inequality in step 7 to become . This saves another factor of two.

We finally recall that we opted to present our algorithm for number theoretic FFTs. When switching to approximate complex FFTs, one advantage is that one may work more extensively with primitive roots of power of two orders. This should also allow for the use of Crandall–Fagin's reduction (in a similar way as in [25]) to further release the assumption on the Linnik constant to .

## 7.Proof of Theorem 1.2 in characteristic

Recall that elements of the finite field are represented as remainder classes of polynomials in modulo , for some given monic irreducible polynomial of degree . Using Kronecker substitution and segmentation, one has the following well known complexity bounds [27, section 3]:

In order to establish Theorem 1.2, it therefore suffices to consider the case when is a prime number. In this section, we first prove the theorem in the case when . The case is treated in the next section. If , then it is well known that , again using Kronecker substitution. In this particular case, Theorem 1.2 therefore follows from Theorem 1.1. Let be a Linnik constant with . Our algorithm proceeds as follows.

Step 0: setting up the parameters. Assume that we wish to multiply two polynomials of degree in . For with and suitable parameters and , we will reduce this problem to a multiplication problem in . If , then we use Kronecker multiplication. The factor is a sufficiently large absolute constant such that various asymptotic inequalities encountered during the proof hold for all . The other parameters are determined as follows:

• .

• is the smallest power of two with .

• With the notations from section 5.3 and for each , let be the -th element in the list with and .

• .

• .

• .

• .

Since , we must have . From Lemma 5.4, we get

for sufficiently large , where

and is a constant. As in the previous section, the prime numbers and the rest of the parameters can be computed in linear time . Let us proceed with a few useful bounds that hold for sufficiently large . From the definitions of and , we clearly have

From and , we get

and

Step 0, continued: setting up the FFT field. Notice that . Elements of the field will be represented as polynomials in modulo , where is a monic irreducible polynomial of degree . By [51, Theorem 3.2], such a polynomial can be computed in time , where we used the assumption that .

Setting , we claim that the field admits a primitive -th root of unity , that can also be computed in time . Indeed, divides for each and , whence , so that . We compute by brute force: for all polynomials , starting with those of smallest degree, we check whether is a primitive -th root. By [52, Theorem 1.1, ], this procedure succeeds after at most checks. In order to check whether a given is a primitive -th root of unity, it suffices to verify that for each , which can be done in time using binary powering. The total time to compute a suitable is therefore bounded by , using our assumption that .

Step 1: Kronecker segmentation. Recall that . In order to multiply two polynomials of degree , we cut them in chunks of degree , i.e.

and form the polynomials

Let be the reductions of these polynomials modulo and . Since and , we can read off the product from the product , after which . Computing from and clearly takes linear time , and so does the retrieval of and from . In other words, we have shown that

 (7.1)

Step 2: CRT transforms. At a second stage, we use negacyclic CRT transforms followed by traditional CRT transforms in order to reduce multiplication in to multiplication in , where , , and :

Since , this yields

We perform the multivariate cyclic convolutions in using fast Fourier multiplication, where the discrete Fourier transforms are computed with respect to . In view of (3.8) and

this yields

The reduction of a polynomial of degree in modulo using Barrett's algorithm [3] essentially requires two polynomial multiplications in of degree , modulo the precomputation of a preinverse in time . Performing multiplications in using Kronecker substitution, we thus have

For sufficiently large , we conclude that

 (7.2)

Step 3: multivariate Rader reduction. We compute the multivariate discrete Fourier transforms using Rader reduction. For each , let be such that if and if for . Then

The necessary precomputations for each of the Rader reductions can be performed over instead of , in time

Now for any , we have . Performing multiplications in using -fold Kronecker substitution and (1.6), this yields

whence

Since , , and , the right hand side is bounded by , whence

 (7.3)

Step 4: second CRT transforms. For , we may decompose , where is odd. Using multivariate CRT transforms, this yields

Setting and , this relation can be rewritten as

 (7.4)

Step 5: Nussbaumer polynomial transforms. Since , we have a fast root of unity of order in and in . We may thus compute polycyclic convolutions of order over using fast Fourier multiplication, with Nussbaumer polynomial transforms in the role of discrete Fourier transforms:

Step 6: residual cyclic convolutions. This time, we convert the residual multivariate cyclic convolutions of length back into univariate cyclic convolutions of length using CRT transforms the other way around. This is possible since are pairwise coprime, and we get

whence

We perform the univariate cyclic products using Kronecker substitution. Since reductions of polynomials in modulo can be done in time , this yields

 (7.5)

Step 7: recurse. Combining the relations (7.1), (7.2), (7.3), (7.4) and (7.5), while using the fact that , we obtain

In terms of the normalized cost function

and , this relation becomes

where we used the fact that . Recall that and for , whence , whereas . Consequently, for large , we have

For some suitable universal constant , this yields

For sufficiently large , we have and , whence

In a similar way as in the previous section, this implies for all . Now recall that Kronecker substitution and Theorem 1.1 imply for all , whence . We conclude that .

Remark 7.1. The hypothesis is very pessimistic and mainly due to the way we reduce the residual multivariate cyclic convolutions back to univariate ones. Our assumption that the be pairwise coprime clearly makes this back-reduction very easy. Nevertheless, with more work, we expect that this assumption can actually be released. We intend to come back to this issue in a separate paper on the computation of multivariate cyclic convolution products.

## 8.Proof of Theorem 1.2 in characteristic

The proof of Theorem 1.2 in the case when is similar to the one from the previous section, with two important changes:

• Following Schönhage [48], we need to use triadic DFTs instead of dyadic ones. In particular, will now be a power of three.

• The transform lengths are unavoidably even. This time, we will therefore have , where the are pairwise coprime. As a consequence, we will need a separate mechanism to deal with residual polycyclic convolution products of length in each variable.

Given a ring and an integer , it will be convenient to define

We will also use the abbreviations , , , and so on. Throughout this section, we assume that is a Linnik constant with . Our algorithm proceeds as follows.

Step 0: setting up the parameters. Let and assume that we wish to multiply two polynomials of degree in . For with and suitable parameters and , we will reduce this problem to a multiplication problem in . If , then we use a naive multiplication algorithm. The threshold is a sufficiently large absolute constant such that various asymptotic inequalities encountered during the proof hold for all . The other parameters are determined as follows:

• .

• is the smallest power of three with .

• With the notations from section 5.3 and for each , let be the -th element in the list with .

• .

• .

• is smallest in with .

• .

Since , we must have . From Lemma 5.4, we get

for sufficiently large , where

and is a constant. As before, the prime numbers and the rest of the parameters can be computed in linear time . Let us proceed with a few useful bounds that hold for sufficiently large . From the definitions of and , we clearly have

From and , we get

and

Step 0, continued: setting up the FFT field. As in the previous section, setting , the defining polynomial for and a primitive -th root of unity can both be computed in time .

Step 1: Kronecker segmentation. Recall that . In order to multiply two polynomials of degree , we cut them in chunks of degree , i.e.

and form the polynomials

Let be the reductions of these polynomials modulo and . Since and , we can read off the product from the product , after which . Computing from and clearly takes linear time , and so does the retrieval of and from . In other words, we have shown that

 (8.1)

Step 2: CRT transforms. At a second stage, we use tricyclic CRT transforms followed by traditional CRT transforms in order to reduce multiplication in to multiplication in , where , , and :

Since , this yields

We perform the multivariate cyclic convolutions in using fast Fourier multiplication, where the discrete Fourier transforms are computed with respect to . Since the only possible divisions of elements in by integers are divisions by one, they come for free, so (3.8) yields

The multivariate DFTs can be performed on the coefficients over in parallel:

The reduction of a polynomial of degree in modulo using Barrett's algorithm [3] essentially requires two polynomial multiplications in of degree , modulo the precomputation of a preinverse in time . Performing multiplications in using Kronecker substitution, we thus have

For sufficiently large , we conclude that

 (8.2)

Step 3: multivariate Rader reduction. This step is essentially the same as in section 7; setting , we obtain the bound

 (8.3)

Step 4: second CRT transforms. For , we may decompose , where is coprime with . Using multivariate CRT transforms, this yields

Setting and , this relation can be rewritten as

 (8.4)

Step 5: Nussbaumer polynomial transforms. Since , we have a fast root of unity of order in and in . We may thus compute polycyclic convolutions of order over using fast Fourier multiplication, with triadic Nussbaumer polynomial transforms in the role of discrete Fourier transforms. Using that divisions by integers are again for free in , this yields

Step 6: residual cyclic convolutions. Using CRT transforms, the residual multivariate cyclic convolutions of length are successively transformed into multivariate cyclic convolutions of lengths and then . Using that are pairwise coprime, we get

whence

We perform the univariate cyclic products using Kronecker substitution. Since reductions of polynomials in modulo can be done in time , this yields

 (8.5)

Step 7: recurse. Combining the relations (8.1), (8.2), (8.3), (8.4) and (8.5), while using the fact that , we obtain

In terms of the normalized cost functions

and , the above relation implies

where we used the fact that . Taking , so that , this yields

Recall that and for , whence , whereas . Consequently,

for large . For some suitable universal constant , this yields

For sufficiently large , we have and , whence

In a similar way as in the previous sections, this implies for all , whence .

Remark 8.1. Since it suffices to consider the case when is bounded in our proof, we can pay ourselves the luxury to use a naive algorithm for polynomial multiplications over of bounded degree . It is an interesting question how to do such multiplications more efficiently. First of all, we notice that

via the the change of variables . The bound (3.2) implies that this change of variables can be performed in time . In [50], Schost designed an efficient algorithm for multiplication in the truncated power series ring . The idea is to embed this ring inside

and then to use multipoint evaluation and interpolation at . It turns out that it suffices to compute with series of precision in order to obtain the desired product. Altogether, this leads to an algorithm of bit complexity .

## 9.Variants of Theorem 1.2

There are several variants and generalizations of Theorem 1.2 along similar lines as the results in [27, section 8]. First of all, we have the following:

Theorem 9.1. Assume that there exists a Linnik constant with and let . Then the bit complexity of multiplying two polynomials in satisfies

Proof. Routine adaptation of the proofs of [27, Theorems 8.2 and 8.3].

Until now, we have systematically worked in the bit complexity model for implementations on a multitape Turing machine. But it is a matter of routine to adapt Theorem 1.2 to other complexity models. In the straight-line program (SLP) model, the most straightforward adaptation of [27, Theorem 8.4] goes as follows:

Theorem 9.2. Assume that there exists a Linnik constant with . Let be an -algebra, where is prime. We may multiply two polynomials in of degree less than using additions, subtractions, and scalar multiplications, and non-scalar multiplications. These bounds are uniform over all primes and all -algebras .

Proof. With respect to the proof of [27, Theorem 8.4], the only non-trivial thing to check concerns the number of non-scalar multiplications. These are due to the innermost multiplications of our recursive FFT-multiplications, when unrolling the recursion. Assume first that . Then one level of our recursion reduces the multiplication of two polynomials in to multiplications in , whence to multiplications in . By construction, we have and . Each recursive step therefore expands the data size by a factor at most while passing from degree to degree roughly . The algorithm therefore requires recursive steps and non-scalar multiplications. In the case when , the reasoning is similar: this time, the data size is expanded by a factor at most at each recursive step (since ), whereas the degree in descends from to at most . This leads to the announced complexity bound.

We notice that the number of non-scalar multiplications is far larger than , as for the algorithm from [27, Theorem 8.4]. Ideally speaking, we would like to take exponentially smaller than , which requires a sharper bound on than the one from Lemma 5.4. Assuming that this is possible (modulo stronger number theoretic assumptions), the number of recursive steps would go down to , and the number of non-scalar multiplications to . With more work, it is plausible that the constant can be further reduced.

It is also interesting to observe that the bulk of the other -vector space operations are additions and subtractions: for some constant with , the algorithm only uses scalar multiplications. In a similar way as above, this number goes down to modulo stronger number theoretic assumptions. For bounded , we also notice that scalar multiplications theoretically reduce to additions.

## Appendix A.Turing machine implementations

Recall that all computations in this paper are done in the deterministic multitape Turing model [16]. In this appendix, we briefly review the cost of various basic types of data rearrangements when working in this model.

### A.1.Arrays and sorting

Let be a data type whose instances take bits. An array of elements in is stored as a linear array of bits. We generally assume that the elements are ordered lexicographically by .

What is significant from a complexity point of view is that occasionally we must switch representations, to access an array (say 2-dimensional) by “rows” or by “columns”. In the Turing model, we may transpose an matrix of elements in in time

using the algorithm of [5, appendix]. Briefly, the idea is to split the matrix into two halves along the “short” dimension, and transpose each half recursively.

We will also require more complex rearrangements of data, for which we resort to sorting. Suppose that we can compare elements of in linear time . Then an array of elements of may be sorted in time

using merge sort [36], which can be implemented efficiently on a Turing machine.

Let us now consider a general permutation . We assume that is represented by the array , which takes space on a Turing tape. Given a general data type , the permutation naturally acts on arrays of length with entries in . As long as , this action can be computed in time , by tagging the -th entry of the array by , sorting the tagged entries, and then removing the tags.

Unfortunately, for some of the data rearrangements in this paper, we do not have . Nevertheless, certain special kinds of permutations can still be performed in time . Transpositions are one important example. In the next subsections, we will consider a few other examples.

### A.2.Basic multivariate rewritings

Lemma A.1. Consider a permutation . Let and . Then

Proof. Let be an absolute constant with for all . Let us prove by induction on that . For , the result is clear, so assume that . Let be the mapping with , and for . Setting , and , the natural isomorphism between and corresponds to the transposition of an matrix with coefficients in . Since conversions between and amount to such transpositions, it follows that

By the induction hypothesis, we also have

whence

as desired.

Corollary A.2. For any monic polynomials , the conversions

can be performed in time , where .

Proof. Setting , we may naturally represent elements in

by -variate arrays in . Similarly, elements of

are represented by arrays in . We now apply the lemma.

Corollary A.3. We have

Proof. Take for .

### A.3.CRT transforms

We start with two lemmas from [23, section 2.4]. For convenience we reproduce the proofs here.

Lemma A.4. Let be relatively prime and . Then

Proof. Let , and let

denote the homomorphism that maps to , and acts as the identity on . Suppose that we wish to compute for some input polynomial

Interpreting the list as an array, the -th entry corresponds to . After transposing the array, which costs bit operations, we have an array, whose -th entry is . Now for each , cyclically permute the -th row by slots; altogether this uses only bit operations. The result is an array whose -th entry is , which is exactly the coefficient of

in . The inverse map may be computed by reversing this procedure.

Lemma A.5. Let where the are pairwise coprime. Then

Proof. Using Lemma A.4, we construct a sequence of isomorphisms

the -th of which may be computed in bit operations for some universal constant . The overall cost is bit operations.

### A.4.Multivariate CRT transforms

Lemma A.6. Let and let be the prime power factorization of each . Setting , we have

Proof. We prove the assertion by induction over . For , the result follows from Lemma A.5, so assume that . Denote , , and . Lemma A.5 yields , whence

The induction hypothesis also yields

Adding up the these two bounds, the result follows.

Given , we recall that the classification of finite abelian groups implies the existence of a unique tuple with , , , and

We call the normal form of and say that is in normal form if .

Lemma A.7. If is the normal form of , then

Proof. Let and be the tuples obtained when performing the entry-wise prime power factorizations of and as in Lemma A.6. From Lemma A.6 it follows that and . Since is a permutation of , the result follows from Lemma A.1.

We say that two tuples and are equivalent if they admit the same normal form.

Corollary A.8. If and are equivalent, then

### A.5.Tensor products of direct sums

Consider an -algebra of the form , where is finite and totally ordered . Elements are of the form with , so we may naturally represent them on a Turing tape by concatenating the representations of . In particular, if each is an -algebra of dimension whose elements are represented by vectors of size with entries in , then elements in are simply represented as vectors of size with entries in .

Now consider such algebras and let us form their tensor product . Let be the dimension of the -subalgebra for and . Then elements in are recursively represented as vectors of size with entries in . Using the distributivity of over , we have a natural isomorphism

where

When endowing the index set with the lexicographical ordering, this isomorphism requires some reordering of data on Turing tapes, but we can compute it reasonably efficiently as follows:

Lemma A.9. With the above notations, , and , we have

Proof. Let us prove the assertion by induction on . For , we have nothing to do, so assume that . We may consider an element of as a vector of size with coefficients in . For each , we may compute the projection of on in time . Doing this for all entries of , we obtain an element of

This computation takes a time . Using the induction hypothesis, we can convert into an element

in time . Doing this in order for all and concatenating the resulting , we obtain the conversion of as an element of . The total computation time is bounded by

The conversion in the opposite direction can be computed in the same time by reversing the algorithm.

## Bibliography

[1]

R. Agarwal and J. Cooley. New algorithms for digital convolution. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(5):392–410, 1977.

[2]

M. Agrawal, N. Kayal, and N. Saxena. PRIMES is in P. Annals of Math., 160(2):781–793, 2004.

[3]

P. Barrett. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In A. M. Odlyzko, editor, Advances in Cryptology — CRYPTO' 86, pages 311–323. Berlin, Heidelberg, 1987. Springer Berlin Heidelberg.

[4]

L. I. Bluestein. A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics, 18(4):451–455, 1970.

[5]

A. Bostan, P. Gaudry, and É. Schost. Linear recurrences with polynomial coefficients and application to integer factorization and Cartier-Manin operator. SIAM J. Comput., 36:1777–1806, 2007.

[6]

C. B. Boyer. A History of Mathematics. Princeton Univ. Press, First paperback edition, 1985.

[7]

R. P. Brent and P. Zimmermann. Modern Computer Arithmetic. Cambridge University Press, 2010.

[8]

D. G. Cantor and E. Kaltofen. On fast multiplication of polynomials over arbitrary algebras. Acta Informatica, 28:693–701, 1991.

[9]

J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Computat., 19:297–301, 1965.

[10]

S. Covanov and E. Thomé. Fast integer multiplication using generalized Fermat primes. Math. Comp., 88:1449–1477, 2019.

[11]

A. De, P. P. Kurur, C. Saha, and R. Saptharishi. Fast integer multiplication using modular arithmetic. SIAM J. Comput., 42(2):685–699, 2013.

[12]

M. Fürer. Faster integer multiplication. In Proceedings of the Thirty-Ninth ACM Symposium on Theory of Computing, STOC 2007, pages 57–66. New York, NY, USA, 2007. ACM Press.

[13]

M. Fürer. Faster integer multiplication. SIAM J. Comput., 39(3):979–1005, 2009.

[14]

J. von zur Gathen and J. Gerhard. Modern computer algebra. Cambridge University Press, New York, 3rd edition, 2013.

[15]

I. J. Good. The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society, Series B. 20(2):361–372, 1958.

[16]

[17]

D. Harvey. Faster arithmetic for number-theoretic transforms. J. Symbolic Comput., 60:113–119, 2014.

[18]

D. Harvey. Faster truncated integer multiplication. https://arxiv.org/abs/1703.00640, 2017.

[19]

D. Harvey and J. van der Hoeven. Faster integer and polynomial multiplication using cyclotomic coefficient rings. Technical Report, ArXiv, 2017. http://arxiv.org/abs/1712.03693.

[20]

D. Harvey and J. van der Hoeven. Faster integer multiplication using plain vanilla FFT primes. Math. Comp., 88(315):501–514, 2019.

[21]

D. Harvey and J. van der Hoeven. Faster integer multiplication using short lattice vectors. In R. Scheidler and J. Sorenson, editors, Proc. of the 13-th Algorithmic Number Theory Symposium, Open Book Series 2, pages 293–310. Mathematical Sciences Publishes, Berkeley, 2019.

[22]

D. Harvey and J. van der Hoeven. Integer multiplication in time . Technical Report, HAL, 2019. http://hal.archives-ouvertes.fr/hal-02070778.

[23]

D. Harvey and J. van der Hoeven. Faster polynomial multiplication over finite fields using cyclotomic coefficient rings. Shortened version of [19]. Accepted for publication in J. of Complexity.

[24]

D. Harvey, J. van der Hoeven, and G. Lecerf. Faster polynomial multiplication over finite fields. Technical Report, ArXiv, 2014. http://arxiv.org/abs/1407.3361. Preprint version of [27].

[25]

D. Harvey, J. van der Hoeven, and G. Lecerf. Even faster integer multiplication. Journal of Complexity, 36:1–30, 2016.

[26]

D. Harvey, J. van der Hoeven, and G. Lecerf. Fast polynomial multiplication over . In Proc. ISSAC '16, pages 255–262. New York, NY, USA, 2016. ACM.

[27]

D. Harvey, J. van der Hoeven, and G. Lecerf. Faster polynomial multiplication over finite fields. J. ACM, 63(6), 2017. Article 52.

[28]

D. R. Heath-Brown. Zero-free regions for Dirichlet L-functions, and the least prime in an arithmetic progression. Proc. London Math. Soc., 64(3):265–338, 1992.

[29]

M. T. Heideman, D. H. Johnson, and C. S. Burrus. Gauss and the history of the fast Fourier transform. Arch. Hist. Exact Sci., 34(3):265–277, 1985.

[30]

J. van der Hoeven. Relax, but don't be too lazy. JSC, 34:479–542, 2002.

[31]

J. van der Hoeven. Faster Chinese remaindering. Technical Report, HAL, 2016. http://hal.archives-ouvertes.fr/hal-01403810.

[32]

J. van der Hoeven, R. Larrieu, and G. Lecerf. Implementing fast carryless multiplication. In J. Blömer, I. S. Kotsireas, T. Kutsia, and D. E. Simos, editors, Proc. MACIS 2017, Vienna, Austria, Lect. Notes in Computer Science, pages 121–136. Cham, 2017. Springer International Publishing.

[33]

A. A. Karatsuba. The complexity of computations. Proc. of the Steklov Inst. of Math., 211:169–183, 1995. English translation; Russian original at pages 186–202.

[34]

A. Karatsuba and J. Ofman. Multiplication of multidigit numbers on automata. Soviet Physics Doklady, 7:595–596, 1963.

[35]

D. E. Knuth. The Art of Computer Programming, volume 2: Seminumerical Algorithms. Addison-Wesley, 1969.

[36]

D. E. Knuth. The Art of Computer Programming, volume 3: Sorting and Searching. Addison-Wesley, Reading, MA, 1998.

[37]

S. Lang. Algebra. Springer Verlag, 2002.

[38]

J. Li, K. Pratt, and G. Shakan. A lower bound for the least prime in an arithmetic progression. The Quarterly Journal of Mathematics, 68(3):729–758, 2017.

[39]

Yu. V. Linnik. On the least prime in an arithmetic progression I. The basic theorem. Rec. Math. (Mat. Sbornik) N.S., 15(57):139–178, 1944.

[40]

Yu. V. Linnik. On the least prime in an arithmetic progression II. The Deuring-Heilbronn phenomenon. Rec. Math. (Mat. Sbornik) N.S., 15(57):347–368, 1944.

[41]

O. Neugebauer. The Exact Sciences in Antiquity. Brown Univ. Press, Providence, R.I., 1957.

[42]

H. J. Nussbaumer. Fast polynomial transform algorithms for digital convolution. IEEE Trans. Acoust. Speech Signal Process., 28(2):205–215, 1980.

[43]

H. J. Nussbaumer. Fast Fourier Transforms and Convolution Algorithms. Springer-Verlag, 2nd edition, 1982.

[44]

H. J. Nussbaumer and P. Quandalle. Computation of convolutions and discrete Fourier transforms by polynomial transforms. IBM J. Res. Develop., 22(2):134–144, 1978.

[45]

J. M. Pollard. The fast Fourier transform in a finite field. Mathematics of Computation, 25(114):365–374, 1971.

[46]

C. M. Rader. Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE, 56:1107–1108, 1968.

[47]

A. Schönhage. Multiplikation großer Zahlen. Computing, 1(3):182–196, 1966.

[48]

A. Schönhage. Schnelle Multiplikation von Polynomen über Körpern der Charakteristik 2. Acta Infor., 7:395–398, 1977.

[49]

A. Schönhage and V. Strassen. Schnelle Multiplikation großer Zahlen. Computing, 7:281–292, 1971.

[50]

É. Schost. Multivariate power series multiplication. In Manuel Kauers, editor, ISSAC '05, pages 293–300. New York, NY, USA, 2005. ACM Press.

[51]

V. Shoup. New algorithms for finding irreducible polynomials over finite fields. Math. Comp., 189:435–447, 1990.

[52]

V. Shoup. Searching for primitive roots in finite fields. Math. Comp., 58:369–380, 1992.

[53]

I. Shparlinski. On finding primitive roots in finite fields. Theoret. Comput. Sci., 157(2):273–275, 1996.

[54]

D. E. Smith. History of Mathematics, volume 2. Dover, 1958.

[55]

L. H. Thomas. Using computers to solve problems in physics. Applications of digital computers, 458:42–57, 1963.

[56]

A. L. Toom. The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Mathematics, 4(2):714–716, 1963.

[57]

T. Xylouris. On the least prime in an arithmetic progression and estimates for the zeros of Dirichlet L-functions. Acta Arith., 1:65–91, 2011.