<\body> <\hide-preamble> >> \; ||<\author-affiliation> Laboratoire d'informatique, UMR 7161 CNRS Campus de l'École polytechnique 1, rue Honoré d'Estienne d'Orves Bâtiment Alan Turing, CS35003 91120 Palaiseau >>|>> The Chinese remainder theorem is a key tool for the design of efficient multi-modular algorithms. In this paper, we study the case when the moduli ,\,m>> are fixed and can even be chosen by the user. Through an appropriate use of the technique of FFT-trading, we will show that this assumption allows for the gain of an asymptotic factor |)>> in the complexity of \PChinese remaindering\Q. For small >, we will also show how to choose \Pgentle moduli\Q that allow for further gains at the other end. The multiplication of integer matrices is one typical application where we expect practical gains for various common matrix dimensions and integer bitsizes. |> >>>>>>>>>>>>>> Modular reduction is an important tool in computer algebra and elsewhere for speeding up computations. The technique allows to reduce a problem that involves large integer or polynomial coefficients to one or more similar problems that only involve small modular coefficients. Depending on the application, the solution to the initial problem is reconstructed the Chinese remainder theorem or Hensel's lemma. We refer to5> for a gentle introduction to this topic. In this paper, we will mainly be concerned with multi-modular algorithms over the integers that rely on the Chinese remainder theorem. The archetype of such an algorithm works as follows. We start with a polynomial function \\>>. For any modulus , reduction of modulo yields a new function :/|m*\|\>|)>\/|m*\|\>|)>> such that <\eqnarray*> ,\,x|)> mod m>|| mod m,\,x mod m|)>>>>> for all ,\,x\\>. Given an algorithm to compute that only uses ring operations on integers, it suffices to replace each ring operations by its reduction modulo in order to obtain an algorithm that computes >. Now given integers ,\,x\\> and ,\,y|)>>=f,\,x|)>>, assume that we know a bound \> with |\|>\B> for ,r> and |\|>\B> for ,s>. Then the following multi-modular algorithm provides with an alternative way to compute ,\,x|)>>: <\enumerate> Select moduli ,\,m>> with *\*m>\2*B> that are mutually coprime. For ,r>, compute \x mod m> for ,\>. For ,\>, compute ,\,y|)>\f>,\,x|)>>. For ,s>, reconstruct > from the values \y mod m> with ,\>. Step 1 consists of (finding the > as a function of >) and step3 of (finding > as a function the >); this is where the Chinese remainder theorem comes in. For a more detailed example with an application to integer matrix multiplication, we refer to section. In favourable cases, the cost of steps 0, 1 and 3 is negligible with respect to the cost of step2. In such situations, the multi-modular algorithm to compute is usually much faster than the original algorithm. In less favourable cases, the cost of steps 1 and 3 can no longer be neglected. This raises the question whether it is possible to reduce the cost of these steps as much as possible. Two observations are crucial here. First of all, the moduli ,\,m>> are the same for all multi-modular reductions and multi-modular reconstructions. If is large, then this means that we can essentially assume that ,\,m>> were fixed once and for all. Secondly, we are free to choose ,\,m>> in any way that suits us. By making each > fit into a machine word, one may ensure that every modular operation only takes a few cycles. Special \PFFT-moduli\Q are often used as well for speeding up polynomial arithmetic. In this paper, we will show how to exploit both of the above observations. For fixed moduli, we will show in section how to make Chinese remaindering asymptotically more efficient by a factor |)>> when > gets large. In section, we show that it is possible to construct \Pgentle modulo\Q that allow for speed-ups when > is small (64>). Both results can be combined in order to accelerate Chinese remaindering for all possible values of >. The new asymptotic complexity bounds make heavy use of discrete Fourier transforms. For our purposes, it is crucial to avoid \Psynthetic\Q FFT schemes that require the adjunction of artificial roots of unity as in Schönhage\UStrassen multiplication. Instead, one should use \Pinborn\Q FFT schemes that work with approximate roots of unity in > or roots of unity with high smooth orders in finite fields; see and. Basic complexity properties of integer multiplication and division based on fast Fourier techniques are recalled in section.\ Let > be the bit complexity for multiplying two -bit numbers. Given pairwise comprime moduli ,\,m>>> of bit-size , it is well known that multi-modular reduction and reconstruction can be carried out in time |)>*log \|)>> using so called trees>. Recent improvements of this technique can be found in. The main goal of section is to show that this complexity essentially drops down to |)>*log \/log log \|)>> in the case when all moduli ,\,m>> are fixed; see Theorems and for more precise statements. The main idea is to increase the arities of nodes in the remainder tree, while performing the bulk of the computations at each node using Fourier representations. This technique of trading faster algorithms against faster representations was also used in, where we called it ; see also. The same approach can also be applied to the problem of base conversion(see section) and for univariate polynomials instead of integers(see section). Having obtained a non trivial asymptotic speed-up for large >, we next turn our attention to the case when > is small (say \64>). The main goal of section there is to exhibit the existence of ,\,m>> for which Chinese remaindering becomes more efficient than usual. The first idea is to pick moduli > of the form -\>, where is somewhat smaller than the hardware word size, is even, and \2>. In section, we will show that multi-modular reduction and reconstruction both become a lot simpler for such moduli. Secondly, each> can be factored as =-\|)>*+\|)>> and, if we are lucky, then both -\> and +\> can be factored into moduli that fit into machine words. If we are very lucky, then this allows us to obtain > moduli > of bitsize w> that are mutually coprime and for which Chinese remaindering can be implemented efficiently. Gentle moduli can be regarded as the integer analogue of \Pspecial sets of points\Q that allowed for speed-ups of multi-point evaluation and interpolation in. <\acknowledgments*> We would like to thank Grégoire for pointing us to Bernstein's work on the topic of this paper. Throughout this paper we will assuming the deterministic multitape Turing model in order to analyze the \Pbit complexity\Q of our algorithms. We will denote by > the cost ofbit integer multiplication. The best current bound for > is <\eqnarray*> >||> n>|)>,>>>> where > n\min \:|k\>\log|)>\1|}>> is called the iterator of the logarithm. For large , it is well known that the fastest algorithms for integer multiplication are all based on the discrete Fourier transform: denoting by > the cost of a\Psuitable Fourier transform\Q of bitsize and by > the cost of the \Pinner multiplications\Q for this bitsize, one has <\eqnarray*> >||+.>>>> For the best current algorithm from, we have <\eqnarray*> >||> n>|)>>>|*>||> n>|)>.>>>> One always has =o|)>>. The actual size of Fourier transforms is usually somewhat restricted: for efficiency reasons, it should be the product of powers of small prime numbers only, such as , and. Fortunately, for large numbers , it is always possible to find \2>*3>*5>> with \n> and/n=1+o>. It is also well known that fast Fourier transforms allow for several tricks. For instance, if one of the multiplicands of an -bit integer product is fixed, then its Fourier transform can be precomputed. This means that the cost of the multiplication drops down to <\eqnarray*> >||+\*.>>>> In particular, the complexity > of multiplying an -bit integer with an -bit one (for n>) satisfies <\eqnarray*> >|||*++o|>*.>>>> Squares of-bit numbers can be computed in time |)>*\*> for the same reason. Yet another example is the multiplication of two 2> matrices with >bit integer entries: such multiplications can be done in time |)>*\4*> by transforming the input matrices, multiplying the transformed matrices in the \PFourier model\Q, and then transforming the result back. In the remainder of this paper, we will systematically assume that asymptotically fast integer multiplication is based on fast Fourier transforms. In particular, we have() for certain functions > and >. We will also assume that the functions /*log n|)>> and /n> are (not necessarily strictly) increasing and that =o*log n*log log n|)>>. These additional conditions are satisfied for() and(). The first condition is violated whenever the FFT scheme requires the adjunction of artificial roots of unity. This happens for Schönhage\UStrassen multiplication, in which case we have =O> and =O>). We will say that an FFT-scheme is if it satisfies our additional requirements. For a hypothetical integer multiplication that runs in time =o> n>|)>>, but for which /> is (not necessarily strictly) increasing, we also notice that it is possible to design an inborn FFT-based integer multiplication method that runs in time |)>>; this is for instance used in. Let > denote the cost of euclidean division with remainder of an -bit integer by an -bit one. In 3.2>, we gave an algorithm for the euclidean division of a polynomial of degree 2*n> by another polynomial of degree n>. This algorithm is based on , a technique that consists of doing as much work as possible in the FFTmodel even at the expense of using naive, suboptimal algorithms in the FFT-model. The straightforward adaptation of this division to integer arithmetic yields the asymptotic bound <\eqnarray*> >|>|+o|)>*.>>>> Furthermore, the discrete Fourier transforms for the dividend make up for roughly one fifth of the total amount of transforms. For 2*n>, the cost of the transforms for the dividend does not grow with , which leads to the refinement <\eqnarray*> >|>||*-1+o|>*2*n|)>.>>>> Similarly, if >, then <\eqnarray*> >|>|+o|)>**|)>,>>>> since the bulk of the computation consists of multiplying the approximate bit quotient with the >-bit dividend. If the dividend is fixed, then we even get <\eqnarray*> >|>|+o|)>**|)>,>>>> since the Fourier transforms for the dividend can be precomputed. Let us start with a few definitions and notations. Given \> and \>, we define <\eqnarray*> >||:|0\k\2|\>|}>>>>> be the set of dyadic fixed point numbers of bitsize and with exponent . Given \> and \>>, we denote by <\eqnarray*> ||||\>*m\>>>> the remainder of the euclidean division of by . Given \> and \\>>, we say that \\> is an >-approximation> of if -x|\|>\\>. We also say that > is a >approximation> of if -x|\|>>\\>. Here -x|\|>>\min\> -x-k|\|>> denotes the between > and . Let \\>, \\> and \>. , Bernstein observed in that we may compute a circular >approximation \\> for as follows. We first compute the product -1|)>> of and modulo -1> and then take =|2*m|\>*2>. Let us show that > indeed satisfies the required property. By construction, there exists an \> with 2> such that -1|)>>. Therefore, m*2-\2> and =a*-1|)>*2=a-a*2>, whence \x*y-\a+2>. More generally, if we only have a circular *2|]>>-approximation =k*2\\> of a number \> (instead of a number k*2\\> as above), then the algorithm computes acircular *\|)>*2|]>>-approximation > of . Bernstein's trick is illustrated in Figure: we are only interested in the highlighted portion of the product. We notice that these kinds of truncated products are reminiscent of the \Pmiddle product\Q in the case of polynomials; in our setting, we also have to cope with carries and rounding errors. When using FFT-multiplication, products modulo -1> can be computed using three discrete Fourier transforms of bitsize , so the cost is essentially the same as the cost of an -bit integer multiplication. If one of the arguments is fixed, then the cost becomes asymptotic to *>. <\big-figure||gr-frame|>|gr-geometry||gr-arrow-end|\|gr-text-at-halign|center|gr-auto-crop|true|magnify|0.840896415||||>>||||>>|||>>|||>>|>>|>>|>>|>>|>>|>>|>>>>> Product modulo one of \> and \> with significant bits. More generally, for \|}>>, let \\*n,-\*n>> be a circular *2*n>>-approximation of anumber \> and let \-1|)>*n,0>>. Then we may compute a circular approximation \\> of as follows. Consider the expansions <\equation> =>*2,y=-2>y*2, with >,\,,y,\,y-2>\\>; see Figure. By what precedes, for ,\-2>, we may compute circular >-approximations> for <\eqnarray*> >||*2+*2|)>*y|]> rem 1.>>>> Setting <\eqnarray*> >||>*2+1-i|)>*n>+\+*2|)>*y|]> rem 1,>>>> it follows that -v|\|>>\2\2>, whereas *y rem 1=+\+v-2>|)>> rem 1>. Taking =+\+u-2>|)> rem 1>, it follows that -v|\|>>\-2|)>*2> and <\eqnarray*> -z|\|>>>|>|-2+y*2-1|)>*n>*\|)>*2.>>>> When using FFT-multiplication, we notice that the sum +\+v-2>> can be computed in the FFT-model, before being transformed back. In that way, we only need +1> instead of > transforms of size , for a total cost of *\++o|)>*>. <\big-figure||gr-frame|>|gr-geometry||gr-text-at-halign|center|gr-auto-crop|true|gr-grid||gr-grid-old||1>|gr-edit-grid-aspect|||>|gr-edit-grid||gr-edit-grid-old||1>|gr-text-at-valign|center|magnify|0.840896415||||>>||||>>|>>|>>|>>|>>|>>|>>|||>>||>>||>>||>>||>>||||>>||>>||||>>||>>||>>|*n|>>||>>|-1|)>*n|>>|>>||>>|||>>>>> Product modulo one of \*n,-\*n>> and \-1|)>*n,0>> with significant bits. For actual machine implementations of large integer arithmetic, it is customary to choose a base of the form > and to perform all computations with respect to that base. We will call the . For processors that have good hardware support for integer arithmetic, taking or is usually most efficient. The GMP package uses thisapproach. However, modern processors are often better at floating point arithmetic. General purpose processors generally provide double precision IEEE-768 compliant instructions, whereas GPUs are frequently limited to single precision. The respective choices50> and22> are most adequate in these cases. It is good to pick slightly below the maximum bitsize of the mantissa in order to accelerate carry handling. We refer to for more details on the implementation of multiple precision arithmetic based on floating point arithmetic. Another particularity of recent processors is the availability of ever wider SIMD (Single Instruction Multiple Data) instructions. For modern implementations of large integer arithmetic, we therefore recommend to focus on the problem of multiplying (>) large integers of a given bitsize instead of a single one. We again refer to for more details. In what follows, when using integer arithmetic, we will denote by the maximal bitsize such that we have a hardware instruction to multiply two integers of bits ( or). When using floating point arithmetic, we let be the bitsize of amantissa ( or ). We will call the . For implementations of multiple precision arithmetic, we always have W>, but it can be interesting to take W>. For moduli that fit into a machine word, arithmetic modulo can be implemented efficiently using hardware instructions. In this case, the available algorithms again tend to be somewhat faster if the size of is a few bit smaller than . We refer to for asurvey of the best currently available algorithms in various cases and how to exploit SIMD instructions. When using arithmetic modulo , it is often possible to delay the reductions modulo as much as possible. One typical example is modular matrix multiplication. Assume that we have two r> matrices with coefficients modulo (represented by integers between and , say). If > fits into a machine word, then we may multiply the matrices using integer or floating point arithmetic and reduce the result modulo . This has the advantage that we may use standard, highly optimized implementations of matrix multiplication. One drawback is that the intermediate results before reduction require at least twice as much space. Also, the bitsize of the modulus is at least twice as small as . For any integer 1>, we will write =,m-1|}>>. We recall: <\render-theorem|Chinese Remainder Theorem> Let ,\,m>> be positive integers that are mutually coprime and denote *\*m>. Given \\>,\,a>\\>>>, there exists a unique \> with a |)>> for ,\>. We will prove the following more constructive version of this theorem. <\theorem> Let ,\,m>> be positive integers that are mutually coprime and denote *\*m>. There exist ,\,c>\\> such that for any \\>,\,a>\\>>>, the number <\eqnarray*> ||*a+\+c>*a>|)> rem M>>>> satisfies a |)>> for ,\>. <\notation*> We call ,\,c>> the cofactors for ,\,m>> in and also denote these numbers by ;M>=c,\,c>;M>=c>>. <\proof> If =1>, then it suffices to take =1>. If =2>, then the extended Euclidean algorithm allows us to compute a Bezout relation <\eqnarray*> *m+k*m>||>>>> where \\>> and \\>>. Let us show that we may take <\eqnarray*> >||*m\\*m>>>|>||*m\\*m>.>>>> Indeed, given \\>> and \\>>, we have <\equation*> k*m*x+k*m*x\x\c*a+c*a\k*m*a+k*m*a *m|)>. In particular, > divides *m*|)>>. Since() implies *m,m|)>=1>, it follows that a |)>>. Similarly, a |)>>. For \2>, we will use induction. Let |\/2|\>>, =m*\*m> and =m*\*m>. By induction, we may compute ;M>>, ;M>>, ;M>,\,c;M>> and ;M>,\,c>;M>>. We claim that we may take <\eqnarray*> >>||;M>*c;M>,h|)>>>|>>||;M>*c;M>,\|)>.>>>> Indeed, for ,h>, we get <\eqnarray*> |>|;M>*;M>*a+\+c;M>*a|)>+c;M>*;M>*a+\+c>;M>*a>|)>>>||>|;M>*a+\+c;M>*a |)>,>>>> whence a |)>>. For ,\> we obtain a |)>> in a similar way. Let ,\,m>>, *\*m>>, \\>,\,a>\\>>> and \> be as in the Chinese remainder theorem. We will refer to the computation of ,\,a>> as a function of as the problem of . The inverse problem is called . In what follows, we assume that ,\,m>> have been fixed once and for all. The simplest way to perform multi-modular reduction is to simply take <\eqnarray*> >|>|,\|)>.>>>> Inversely, Theorem provides us with a formula for multi-modular reconstruction: <\eqnarray*> |>|;M>*a+\+c>;M>*a>|)> rem M.>>>> Since ,\,m>> are fixed, the computation of the cofactors ;M>> can be regarded as aprecomputation. Let us analyze the cost of the above methods in terms of the complexity > of bit integer multiplication. Assume that \2> for ,\>. Then multi-modular reduction can clearly be done in time **n,n|)>>=*\-1+o|)>*\*>. As to multi-modular reconstruction, assume that \2>> for ,\>, where \n-|log \|\>> is such that *2>\2>. Cutting the cofactors in chunks of bits as in(), we precompute the Fourier transform of all obtained chunks. The Fourier transforms of ,\,a>> can be computed in time \*>. The sum ;M>*a+\+c>;M>*a>> can be computed in the Fourier model in time *\> and transformed back in time *\+O|)>>. Our assumption that *m\2> for ,\> ensures that the computed sum is correct. The remainder can be computed in time +1|)>*n,\*n|)>\*\+5*+O*\|)>>. The total time of the reconstruction is therefore bounded by <\eqnarray*> ,naive>>|)>>||*\+2**\+5*+O*\|)>|)>.>>>> If we only assume that \2>, then we need to increase the bitsize by |log \|\>>. If *log \=O>, then this means that we have to multiply the right-hand side of() by /n|)>=1+O|)>>. The above complexity analysis shows that naive multi-modular recomposition can be done faster than naive multi-modular reduction. In order to make the latter operation more efficient, one may work with that were introduced in. The idea is that each remainder of the form is replaced by rem 1>. The quotient > is regarded as a real number and its remainder modulo as anumber in the interval>. If we allow ourselves to compute with exact real numbers, then this leads us to replace the relation() by <\eqnarray*> > rem 1>||>* rem 1|)>|]> rem 1,\|)>>>>> and() by <\eqnarray*> rem 1>||;M>*m|M>*|m> rem 1|)>+\+>;M>*m>|M>*>|m>> rem 1|)>|]> rem 1.>>>> For actual machine computations, we rather work with fixed point approximations of the scaled remainders. In order to retrieve the actual remainder from the scaled one rem 1>, we need a circular >-approximation of rem 1>. Now assume that ,\,m\\>>> with <\eqnarray*> >|>||log |)>|\>.>>>> Given a circular *2*n>|]>>approximation of rem 1> in *n,-\*n>> with <\eqnarray*> >|>||)>*-1|)>>,>>>> the algorithm at the end of section allows us to compute a circular *2|]>>approximation modulo of > rem 1>, by applying the formula(). Since *2\2-1>>, we may then recover the number > using one final bit multiplication. Moreover, in the FFT-model, the transforms for rem 1> need to be computed only once and the transforms for the numbers >> can be precomputed. In summary, given an approximation for the scaled remainder rem 1>, we may thus compute approximations for the scaled remainders > rem 1> in time <\eqnarray*> ,scaled>|)>>||*\+2**\+O*\|)>.>>>> From this, we may recover the actual remainders > in time *>. Scaled remainders can also be used for multi-modular reconstruction, but carry handling requires more care and the overall payoff is less when compared to the algorithm from the previous subsection. It is well-known that Chinese remaindering can be accelerated using a similar dichotomic technique as in the proof of Theorem. This time, we subdivide =,\,m>|}>> into parts =+1>,\,m>|}>,\,\=+1>+1,\,m>|}>> with =||)>/k|\>> for ,k>. We denote =m+1>*\*m>> and assume that \k> (if \k>, then we apply the native algorithms from the previous subsections). Fast multi-modular reduction proceeds as follows. We first compute <\eqnarray*> >||,k|)>>>>> using the algorithm from the previous subsection. Next, we recursively apply fast multi-modular reduction to obtain <\eqnarray*> >|| rem m+1,\,\|)>.>>>> The computation process can be represented in the form of a so called ; seeFigure. The root of the tree is labeled by . The children of the root are the remainder trees for > modulo >, where ,k>. If needed, then the arity can be adjusted as afunction of the bitsize of the moduli and >. Fast multi-modular reconstruction is done in a similar way, following the opposite direction. We first reconstruct <\eqnarray*> >||+1>;M>*a+1>+\+c>;M>*a>|)> rem M,k|)>,>>>> followed by <\eqnarray*> ||;M>*X+\+c;M>*X|)> rem M.>>>> The computation flow can again be represented using a tree, but this time the computations are done in a bottom-up way. Following the approach from subsection, it is also possible to systematically work with fixed point approximations of scaled remainders rem 1> instead of usual remainders . In that case, the computation process gives rise to a as in Figure. Of course, the precision of the approximations has to be chosen with care. Before we come to this, let us first show how to choose . <\big-figure> ||||>||||>> <|big-figure> Example of a remainder tree with arities and at the two recursion levels. In the case of a reduction, the remainders are computed top-down. In the case of a reconstruction, they are reconstructed in a bottom-up fashion. <\big-figure> 2>|2>|2>|2>|2>>|2>|2>|2>|2>>> <|big-figure> The scaled remainder tree corresponding to Example, while using fixed point approximations for the scaled remainders. Let us first focus on multi-modular reconstruction. In order to make this process as efficient as possible, the arity should be selected as a function of and> so as to make *\> as large as possible in(), while remaining negligible with respect to *\>. Let <\eqnarray*> |)>>||*n|)>|*n|)>*log log *n|)>>=o*n|)>|*n|)>>|)>.>>>> For inborn FFT schemes, we notice that <\eqnarray*> ||)>>||*n|)>|log log *n|)>>|)>>>||)>>|| log *n|)>|log *n|)>>|)>=O log n|log n>|)>.>>>> For the root of the remainder tree, we take <\eqnarray*> =\|)>>||>|\\\|)>>>||>|\>>|\\\|)>\\>>||)>>|>>>>>>>>> Using the same formula recursively for the other levels of the remainder tree, it is natural to define the sequences ,\,\> and ,\,k> by =\>, =\|)>> and =|\/k|\>> for ,r>>; the construction stops as soon as =1>. Notice that we always have >\\>>. The precise choice of |)>> is motivated by the following lemmas. <\lemma> If 1>, then =O|)>>. <\proof> We clearly cannot have \\|)>>, since otherwise =1>. If \\|)>\\>, then <\equation*> \=|\/|>|\>|\>=O|)>\O|)>|)>=O|)>. If |)>\\>, then <\equation*> \=|\/\|)>|\>\\\\|)>=O|)>. This proves the result in all cases. <\lemma> If 1>, then we have *\*k*\\\> for ,r+1>. <\proof> For ,r>-1>, we have \\|)>=O log n/log n|)>>, so that \3> and \\/2> whenever is sufficiently large. By construction, we also have <\eqnarray*> *\>|>|>|)>*\.>>>> By induction, it follows that <\eqnarray*> *\*k*\>|>|>|)>*\*>|)>*\>>||>|>+\+>|)>*\>>||>|>|)>*\>>||>|,>>>> for ,r>. We also have *\*k*\=k*\*k*\\\>. <\lemma> We have |)>*|log log *n|)>>>. <\proof> Let \0> and let be smallest such that |)>\|)>*log log |)>-1>. For all s>, we have \|)>*log log |)>+log \>, whence <\eqnarray*> |>|||)>*log log *n|)>>.>>>> Let 0> be a constant such that |)>\c**n|)>|log log *n|)>>> for all >. We also have <\equation*> log*n|)>|log log *n|)>>|)>\log \|)>\|)>*log log |)>, so that <\eqnarray*> *n|)>>|| log *n|)>**n|)>|)>>|)>>>||| log *n|)>**n|)>|)>>|)>.>>>> Since \2*\> for all r>, it follows that <\eqnarray*> |>||log 2>=O log *n|)>**n|)>|)>>|)>.>>>> Adding up() and() while letting > tend to zero, the result follows. <\lemma> We have =O>. <\proof> By construction, \\|)>=O*n|)>|)>>. For large , this means that \n>, since otherwise =O|)>|)>=O|)>>, which is impossible. Consequently, =O|)>|)>=O>. Let >|)>> be the complexity multi-modular reconstruction for fixed moduli ,\,m>> with \2> for ,\>. <\theorem> If integer multiplication is based on an inborn FFT scheme, then <\eqnarray*> >|)>>|>|+o|)>**n|)>*max|log log |)>>,1+O|)>|)>.>>>> This bound holds uniformly in > for \>. <\proof> In the special case when , the bound() yields <\eqnarray*> >|)>>|>||)>|)>**\+2**\+5*|)>+O*\|)>>>||||)>|)>**\+|)>|)>**\>>||>||)>|)>**\+|)>|)>**\|)>*\>>||||)>+o|)>**\>>|||+o|)>**\+O|)>,>>>> and we are done. If 1>, then() implies <\eqnarray*> >|)>>|>||)>|)>*|)>*k+|)>|)>*|)>*k+>|)>*k>>||>||)>*|)>*k+>|)>*k,>>>> for ,r>. By induction, and using the fact that >=0>, we get <\eqnarray*> |>|)>>|>||)>*|)>*k*\*k.>>||||)>*|)>*|\>.>>||>||)>*|)>>>||||)>*r*|)>.>>>> The result now follows from Lemma and(). <\remark> For =o> and =O>, the bound() simplifies into <\eqnarray*> >|)>>|>|+o|)>**n|)>.>>>> If |)>>, then the bound becomes <\eqnarray*> >|)>>|>|+o|)>**n|)>*|log log \>.>>>> <\remark> It is interesting to examine the cost of the precomputations as a function of the parameters , > and ,\,m>>. For a node of the remainder tree at level , we essentially need to compute > cofactors and their transforms. This can be done in time *|)>|)>>. Since we have *\*k> nodes at level , the total precomputation time at level is therefore bounded by *\*k*|)>|)>=O*|)>|)>>. Now =o|)>/log log |)>|)>=o|)>/log log |)>|)>>. Consequently, the total precomputation time is bounded by <\eqnarray*> >|)>>|||)>*|)>|log log |)>>|)>>>||||)>*|)>*log \|log log |)>>|)>.>>>> Let us now consider the complexity |)>> of multi-modular reduction for fixed moduli ,\,m>> with \2> for ,\>. In this case, it is most efficient to work with scaled remainders, so the algorithm contains three main steps: <\enumerate> The initial conversion of into (an approximation of) rem 1>. The computation of (approximations of) the scaled remainders > rem 1>. The final conversions of (approximations of) > rem 1> into >. At a first stage, we will assume that ,\,m>\2>>, where \n> is sufficiently small such that the final approximations of the scaled remainders > rem 1> allow us to recover the usual remainders >. Let |~>|)>> denote the cost of step 2. The conversions in steps 1 and 3 boil down to multiplications with fixed arguments, so that <\eqnarray*> >|)>>|>||~>|)>++o|)>*|)>.>>>> For step 2, we use the scaled remainder tree algorithm from subsection, while taking the arities as in subsection. Our next task is to show that \n-|log|)>|\>> is small enough in order to recover all remainders >. <\lemma> There exists a constant > such that for all n> and ,r>, we have <\eqnarray*> >|>||log|)>|\>*\>=2|)>*\>.>>>> <\proof> For the result clearly holds, so assume that r>. In particular, if is sufficiently large, then it follows that \|>|\>>. Now assume for contradiction that \2|log|)>|\>*\>\2\2>>. Then we would get \k*\\k*log k\|>|\>*log |>|\>>. This is impossible for \2>. Now starting with a circular *n>>-approximation of rem 1>, the scaled reduction algorithm from subsection yields circular *2>|]>>-approximations for the scaled remainders |M> rem 1> at level . Lemma now shows that =2*k> is sufficiently small for a continued application of the same algorithm for the next level. By induction over , the same reasoning shows that the scaled remainders at the >-th level are computed with an error below <\equation*> 2*k*2*n>\2|)>*\>\2*n>\2|)>*\*-1|)>>\2*n>. At the very end, we obtain *2|)>>-approximations for the scaled remainders > rem 1>. Since =k>, this allows us to reconstruct each remainder > using a multiplication by >. This shows that > is indeed small enough for the algorithm to work. <\theorem> If integer multiplication is based on an inborn FFT scheme, then <\eqnarray*> |)>>|>|+o|)>**n|)>*|log log |)>>,1|)>+2|]>.>>>> This bound holds uniformly in > for \>. <\proof> A similar cost analysis as in the proof of Theorem yields <\eqnarray*> |~>|)>>|>|*\+2**\+O*\|)>=+o|)>**\>>>> when and <\eqnarray*> |~>|)>>|>||)>*r*|)>>>>> when 1>. In both cases, combination with() yields <\eqnarray*> >|)>>|>|+o|)>**n|)>*|log log |)>>,1|)>+2|]>.>>>> Notice also that =n-|log|)>|\>\n-O>, by Lemma. Given a number >\n>, we may construct a similar sequence >,\,\>+1>>> when using>> in the role of . Taking >> minimal such that >-|log>>>|)>|\>\n>, we have <\eqnarray*> |)>>|>|+o|)>**n>|)>*|log log >*\|)>>,1|)>+2|]>.>>>> Moreover, n>-O>|)>>, which implies >\n+O>. Plugging this into(), the result follows, since |)>*\|)>\log log |)>> and the assumption that /> is increasing implies *|)>|)>\*n|)>>. <\remark> For =o> and =O>, the bound() simplifies into <\eqnarray*> |)>>|>||)>**n|)>.>>>> If |)>>, then the bound() becomes <\eqnarray*> |)>>|>|+o|)>**n|)>*|log log |)>>.>>>> For very large > with |)>>, this yields <\eqnarray*> |)>>|>|+o|)>**n|)>*|log log \>.>>>> <\remark> Using a similar analysis as in Remark, the cost of all precomputations as a function of , > and ,\,m>> is again bounded by <\eqnarray*> |)>>|||)>*|)>*log \|log log |)>>|)>.>>>> The approach of this section can also be used for the related problem such as base conversion. Let \>> and > be a fixed base and order. Given a number \>>>, the problem is to compute ,\,x-1>\\> with <\eqnarray*> ||+x*b+\+x-1>*b-1>.>>>> Inversely, one may wish to reconstruct from ,\,x-1>>. It is well known that both problems can be solved using asimilar remainder tree process as in the case of Chinese remainders. The analogues for the formulas() and() are() and <\eqnarray*> |b> rem 1>||-1-i>*>> rem 1|)>|]> rem 1,\-1|)>.>>>> The analogue of the recursive process of subsection reduces a problem of size > to similar problems of size |\/k|\>> and one similar problem of size but for the base |\/k|\>>>. Aroutine verification shows that the complexity bounds() and() also apply in thiscontext. Moreover, for nodes of the remainder tree at the same level, the analogues of the cofactors and the multiplicands > in() do not vary as a function of the node. For this reason, the required precomputations as a function of and > can actually be done much faster. This makes it possible to drop the hypothesis that and > are fixed and consider these parameters as part of the input. Let us denote by |)>> and >|)>> the complexities of computing ,\,x-1>> as a function of and . <\theorem> If integer multiplication is based on an inborn FFT scheme, then <\eqnarray*> >|)>>|>|+o|)>**n|)>*|log log |)>>+O|)>|)>>>||)>>|>|+o|)>**n|)>*|log log |)>>+O|)>|)>.>>>> These bound holds uniformly in > for \>. <\proof> Let us estimate the cost of the precomputations as a function of and >. The analysis is similar as in Remark except that we only have to do the precomputations for a single node of the tree at each level . Consequently, the precomputation time is now bounded by*|)>+\+k*|)>|)>>. Since the ,\,\> decrease with at least geometric speed, this cost is dominated by *|)>|)>=o|)>*|)>|log log |)>>|)>>. This proves the result under the condition that |)>>. If =o>, then we need to construct ,\,k> in a slightly different way. Assuming that is sufficiently large, let be maximal such that <\eqnarray*> >>|>|||2>>|\>|)>.>>>> Notice that <\eqnarray*> |>||log log \|\>.>>>> We again set =\> and =|\/k|\>>. This time, we take =2>> for t> and proceed with the usual construction =k|)>> for t>. It follows that <\eqnarray*> >||||2>>|\>,t+1|)>>>|>|>||)>,t|)>>>|>|||)>\2>>>|*\>|||)>>>>> and <\eqnarray*> |>||log log |)>>+o.>>>> Using the new bound for , a similar complexity analysis as in the proofs of Theorems and yields the bounds() and() when forgetting about the cost of the precomputations. Now the cost > of the precomputations for the first levels is bounded by <\eqnarray*> >||*|)>+k*|)>+\+k*|)>|)>>>||||2>|)>+4*|4>|)>+\+2>*|2>>|)>|)>>>||||)>|)>>>>> and the cost > for the remaining levels by <\eqnarray*> >||*|)>+\+k*|)>|)>>>|||*|)>|)>>>|||*|\>*|)>|)>>>||||)>|)>.>>>> We conclude that the right-hand sides of() and() absorb the cost +> of the precomputations. It is quite straightforward to adapt the theory of this section to univariate polynomials instead of integers. An example of this kind of adaptations can be found in. In particular, this yields efficient algorithms for multi-point evaluation and interpolation in the case when the evaluation points are fixed. The analogues of our algorithms for base conversion yield efficient methods for -adic expansions and reconstruction. More precisely, let be an effective commutative ring and let > be the cost of multiplying two polynomials of degree n> in >. Assume that allows for inborn FFT multiplication. Then =3*+>, where > and > satisfy similar properties as in the integer case. Let ,\,Q>> be > monic polynomials of degree . Given apolynomial of degree\*n> in > we may then compute the remainders > for ,\> in time <\eqnarray*> |)>>|>|+o|)>**n|)>*|log log |)>>,1|)>+2|]>.>>>> The reconstruction of from these remainders can be done in time <\eqnarray*> >|)>>|>|+o|)>**n|)>*max|log log |)>>,1+O|)>|)>.>>>> The assumption that admits a suitable inborn FFT scheme is in particular satisfied if is a finite field. When working in an algebraic complexity model, this is still the case if is any field of positive characteristic. For general fields of characteristic zero, the best known FFT schemes rely on the adjunction of artificial roots of unity. In that case, our techniques only result in an asymptotic speed-up by afactor |)>> instead of|)>>. Nevertheless, the field of complex numbers does admit roots of unity of any order, and our algorithms can be used for fixed point approximations of complex numbers at any precision. Multi-point evaluation has several interesting applications, but it is important to keep in mind that our speed-ups only apply when the moduli are fixed. For instance, assume that we computed approximate zeros ,\,z>> to a complex polynomial of degree >, using abit-precision . Then we may use multi-point evaluation in order to apply Newton's method simultaneously to all roots and find approximations of bit-precision 2*n>. Unfortunately, our speed-up does not work in this case, since the approximate zeros ,\,z>> are not fixed. On the other hand, if the polynomial has degree > instead of > and we are still given > approximate zeros ,\,z>> (say all zeros in some disk), then the same moduli are used times, and one may hope for some speed-up when using our methods. At another extremity, it is instructive to consider the approximate evaluation of afixed polynomial +\+P-1>*x-1>> with fixed point coefficients ,\,P-1>\\> at a single point \>. We may thus assume that suitable Fourier transforms of the > have been precomputed. Now we rewrite >+\+P>*|)>> with |>|\>>, |\/k|\>> and >=P+\+P*x>. In order to evaluate each of the >>at, it suffices to transform the numbers ,x>, to perform the evaluations in the Fourier representation and then transform the results back. This can be achieved in time |)>+>+d*>. We may finally compute =P>+\+P>*|)>> using Horner's method, in time |)>>. For large >, the dominant term of the computation time is \\*>. Let us now reconsider the naive algorithms from section, but in the case when the moduli ,\,m>> are all close to a specific power of two. More precisely, we assume that <\eqnarray*> >||+\,\|)>,>>>> where |\|>\2> and 2> a small number. As usual, we assume that the > are pairwise coprime and we let *\*m>>. For such moduli, the naive algorithm for the euclidean division of anumber \*s*w>>> by > becomes particularly simple and essentially boils down to the multiplication of> with the quotient of this division. In other words, the remainder can be computed in time *s*+O*s*w|)>> instead of *s*w,s*w|)>>. For small values of >, and , this gives rise to a speedup by a factor atleast. More generally, the computation of > remainders =x rem m,\,a>=x rem m>> can be done in time *s*+O*s*w|)>>. Multi-modular reconstruction can also be done faster, as follows, using a similar technique as in. Let \>. Besides the usual binary representation of and the multi-modular representation ,\,a>|)>>=,\,x rem m>|)>>, it is also possible to use the <\eqnarray*> ||+b*m+b*m*m+\+b>*m*\*m-1>,>>>> where \\>>. Let us now show how to obtain ,\,b>|)>> efficiently from ,\,a>|)>>. Since =b=a>, we must take =a>. Assume that ,\,b> have been computed. For ,1> we next compute <\eqnarray*> >||+b*m+\+b*m*\*m|)> rem m>>>> using =b> and <\eqnarray*> >||+u*m|)> rem m>>|||+u\-\|)>|)> rem m,1|)>.>>>> Notice that ,\,u> can be computed in time **+O>. We have <\eqnarray*> >||+b*m*\*m|)> rem m=a.>>>> Now the inverse > of *\*m> modulo > can be precomputed. We finally compute <\eqnarray*> >||*-u|)> rem m,>>>> which can be done in time +O>. For small values of , we notice that it may be faster to divide successively by ,\,m> modulo > instead of multiplying with >. In total, the computation of the Newton representation ,\,b>|)>> can be done in time |2>**+*\+O*s*w|)>>. Having computed the Newton representation, we next compute <\eqnarray*> >||+b*m+\+b>*m*\*m-1>>>>> for ,\,1>, using the recurrence relation <\eqnarray*> >||+x*m.>>>> Since \\-i|)>*s*w>>>, the computation of > takes a time *s*+O*s*w|)>>. Altogether, the computation of > from ,\,a>|)>> can therefore be done in time |2>**+*\+O*s*w|)>\\*s*>. For practical applications, we usually wish to work with moduli that fit into one word or half a word. Now the algorithm from the previous subsection is particularly efficient if the numbers > also fit into one word or half a word. This means that we need to impose the further requirement that each modulus > can be factored <\eqnarray*> >||*\*m,>>>> with ,\,m\2>. If this is possible, then the > are called -gentle moduli>. For given bitsizes and 2>, the main questions are now: do such moduli indeed exist? If so, then how to find them? If , then it is easy to construct -gentle moduli =2+\> by taking =-\>, where \\2/2>> is odd. Indeed, <\eqnarray*> -\>||+\|)>*-\|)>>>>> and +\,2-\|)>=gcd+\,2*\|)>=gcd+\,\|)>=gcd,\|)>=1>. Unfortunately, this trick does not generalize to higher values 3>. Indeed, consider aproduct <\eqnarray*> +\|)>*\*+\|)>>||++\+\|)>*2*w>+\>>|||+\+\|)>-+\+\|)>|)>*2*w-1>+\,>>>> where ,\,\> are small compared to >. If the coefficient +\+\> of *w>> vanishes, then the coefficient of *w-1>> becomes the opposite +\+\|)>> of a sum of squares. In particular, both coefficients cannot vanish simultaneously, unless =\=\=0>>. If 2>, then we are left with the option to search -gentle moduli by brute force. As long as is \Preasonably small\Q (say 8>), the probability to hit an -gentle modulus for a randomly chosen > often remains significantly larger than >. We may then use sieving to find such moduli. By what precedes, it is also desirable to systematically take =-\> for \\2/2>>. This has the additional benefit that we \Ponly\Q have to consider /2>> possibilities for >. We implemented a sieving procedure in that uses the package with an interface to . Given parameters > and >, the goal of our procedure is to find -gentle moduli of the form <\eqnarray*> ||-\|)>*-\|)>=m*\*m>>>> with the constraints that <\eqnarray*> >|>|>>>|,2>!|)>>||>>> for ,s>, and \\\m>. The parameter is small and even. One should interpret and > as the intended and maximal bitsize of the small moduli >. The parameter > stands for the minimal bitsize of a prime factor of >. The parameter > should be such that > fits into a machine word. In Table below we have shown some experimental results for this sieving procedure in the case when , , =25> and =4>. For \1000000>, the table provides us with >, the moduli ,\,m>, as well as the smallest prime power factors of the product. Many hits admit small prime factors, which increases the risk that different hits are not coprime. For instance, the number divides both -311385> and -376563>, whence these -gentle moduli cannot be selected simultaneously (except if one is ready to sacrifice a few bits by working modulo -311385,2-376563|)>> instead of -311385|)>\-376563|)>>). In the case when we use multi-modular arithmetic for computations with rational numbers instead of integers (see5 and, more particularly, section5.10>), then small prime factors should completely be prohibited, since they increase the probability of divisions by zero. For such applications, it is therefore desirable that ,\,m> are all prime. In our table, this occurs for =57267> (we indicated this by highlighting the list of prime factorsof ). In order to make multi-modular reduction and reconstruction as efficient as possible, adesirable property of the moduli > is that they either divide -\> or +\>. In our table, we highlighted the > for which this happens. We notice that this is automatically the case if ,\,m> are all prime. If only a small number of> (say a single one) do not divide either -\> or +\>, then we remark that it should still be possible to design reasonably efficient algorithms for multi-modular reduction and reconstruction. Another desirable property of the moduli \\\m> is that > is as small as possible: the spare bits can for instance be used to speed up matrix multiplication modulo >. Notice however that one \Poccasional\Q large modulus > only impacts on one out of modular matrix products; this alleviates the negative impact of such moduli. We refer to section below for more details. For actual applications, one should select gentle moduli that combine all desirable properties mentioned above. If not enough such moduli can be found, then it it depends on the application which criteria are most important and which ones can be released. <\big-table|||||||||>>|>>|>>|>>|>>|>>|>>|>,p>,\>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||,151,1879,\>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>>>>>> List of -gentle moduli for , =25>, =4> and \1000000>. , and >> Ideally speaking, we want to be as large as possible. Furthermore, in order to waste as few bits as possible, > should be close to the word size (or half of it) and -w> should be minimized. When using double precision floating point arithmetic, this means that we wish to take \>. Whenever we have efficient hardware support for integer arithmetic, then we might prefer >. Let us start by studying the influence of -w> on the number of hits. In Table, we have increased by one with respect to Table. This results in an approximate increase of the \Pcapacity\Q of the modulus . On the one hand, we observe that the hit rate of the sieve procedure roughly decreases by a factor of thirty. On the other hand, we notice that the rare gentle moduli that we do find are often of high quality (on four occasions the moduli ,\,m> are all prime in Table). <\big-table||||||||||>>|>>|>>|>>|>>|>>|>>|>,p>,\>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>>>>>> List of -gentle moduli for , =25>, =4> and \16000000>. Without surprise, the hit rate also sharply decreases if we attempt to increase . The results for and are shown in Table. A further infortunate side effect is that the quality of the gentle moduli that we do find also decreases. Indeed, on the one hand, tends to systematically admit at least one small prime factor. On the other hand, it is rarely the case that each > divides either -\> or +\> (this might nevertheless be the case for other recombinations of the prime factors of , but only modulo a further increase of >). <\big-table|||>>|>>|>>|>>|>>|>>|>>|>>|>>|>,p>,\>>>||||||||||>>>||||||||||>>>||||||||||>>>||||||||||>>>||||||||||>>>||||||||||>>>||||||||||>>>||||||||||>>>||||||||||>>>||||||||||>>>>>>>> List of -gentle moduli for , =25>, =4> and \10000000>. An increase of > while maintaining and -w> fixed also results in a decrease of the hit rate. Nevertheless, when going from =25> (floating point arithmetic) to =31> (integer arithmetic), this is counterbalanced by the fact that > can also be taken larger(namely \2>>); see Table for a concrete example. When doubling and > while keeping the same upper bound for >, the hit rate remains more or less unchanged, but the rate of high quality hits tends to decrease somewhat: see Table. It should be possible to analyze the hit rate as a function of the parameters , , > and> from a probabilistic point of view, using the idea that a random number is prime with probability >. However, from a practical point of view, the priority is to focus on the case when \64>. For the most significant choices of parameters \w\w\64> and, it seems in principle to be possible to compile full tables of -gentle moduli. Unfortunately, our current implementation is still somewhat inefficient for \32>. Ahelpful feature for upcoming versions of would be a function to find all prime factors of an integer below a specified maximum >> (the current version only does this for prime factors that can be tabulated). <\big-table|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||>>|>>|>>|>>|>>|>>|>>|>,p>,\>>>||||||||,1480933,\>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>||||||||>>>>>>>> List of -gentle moduli for , =31>, =4> and \1600000>. Followed by some of the next gentle moduli for which each > divides either -\> or +\>. <\big-table|||||||||||||||||||||||||||||||||||||||||||||>>|>>|>>|>>|>>|>>|>,p>,\>>>||||>>|||>>>||||>>|||>>>||||>>|||>>>||||>>|||>>>||||>>|||>>>||||>>|||>>>||||>>|||>>>||||>>|||>>>|||||||>||||>>|||>>>||||>>|||>>>||||>>|||>>>||||>>|||>>>||||>>|||>>>>>>>> List of -gentle moduli for , =50>, =4> and \200000>. Followed by some of the next gentle moduli for which each > divides either -\> or +\>. One of our favourite applications of multi-modular arithmetic is the multiplication of integer matrices \r>>. We proceed as follows: <\enumerate> Compute > and > for ,\>, using > multi-modular reductions. Multiply \|)>*|)> rem m> for ,\>. Reconstruct using > multi-modular reconstructions. If is larger than |\|>> for all and , then can be read off from . From a practical point of view, the second step can be implemented very efficiently if> fits into the size of a word. When using floating point arithmetic, this means that we should have \2> for all . For large values of , this is unrealistic; in that case, we subdivide the r> matrices into smaller \r> matrices with *m\2>. The fact that> may depend on is very significant. First of all, the larger we can take >, the faster we can multiply matrices modulo >. Secondly, the > in the tables from the previous sections often vary in bitsize. It frequently happens that we may take all > large except for the last modulus >>. The fact that matrix multiplications modulo the worst modulus >> are somewhat slower is compensated by the fact that they only account for one out of every > modular matrixproducts. Several of the tables in the previous subsections were made with the application to integer matrix multiplication in mind. Consider for instance the modulus *\*m=2-656997> from Table. When using floating point arithmetic, we obtain \82713>, \1939>, \140>, \61>, \14> and\13>. Clearly, there is a trade-off between the efficiency of the modular matrix multiplications (high values of > are better) and the bitsize\*w> of (larger capacities are better). If is large with respect to M>, then the modular matrix multiplication step is the main bottleneck, so it is important to take all > approximately of the same size ( -w> should be small) and in such a way that the corresponding > lead to the best complexity ( r\6> tends to work well). This can often only be achieved by lowering to or . For closer to M>, the Chinese remaindering steps become more and more expensive, which makes it interesting to consider larger values of and to increase the difference -w>. For significantly below M>, we resort to FFT-based matrix multiplication. This corresponds to taking roughly twice as many moduli, but the transformations become approximately times less expensive. We have not yet implemented any of the algorithms in this paper. The implementation that we envision selects appropriate code as a function of the parameters , > and >. The parameters and > highly depend on the application: see the above discussion in the case of integer matrix multiplication. Generally speaking, > is bounded by the bitsize of amachine word or by . So far, we have described four main strategies for solving Chinese remaindering problems for moduli *\*m>>: <\description> The \Pgentle modulus strategy\Q requires -\> to be an -gentle modulus with >. Conversions between and -\|)>,x rem +\|)>|)>> can be done fast. The further conversions from and to ,\,x rem m|)>> are done in anaive manner. The \Pgentle product strategy\Q strategy requires ,\,m>> to be gentle moduli. We now perform the multi-modular reductions and recombinations using the algorithms from subsection. The \Pnaive strategy\Q for Chinese remaindering has been described in subsections and. The asymptotically fast \PFFT-based strategy\Q from Theorems and, which attempts to do as much work as possible using Fourier representations. The idea is now to apply each of these strategies at the appropriate levels of the remainder tree. At the very bottom, we use , followed by and . At the top levels, we use . As a function of >, we need to decide how much work we wish to perform at each of these levels. For small \64>, it suffices to combine the strategies and . As soon as /2> exceeds , some of the modular reductions in the strategy may become expensive. For this reason, it is generally better to let do a bit more work, to take \\>. It is also a good practice to use as the \Psoft word size\Q for multiple integer arithmetic (see subsection). As soon as > starts to get somewhat larger, say \\256>, then some intermediate levels may be necessary before that FFT multiplication becomes plainly efficient. For these levels, we use strategy with arity two. It still has to be found out how large this intermediate region is, exactly. Indeed, the ability to do more work using the Fourier representation often lowers the threshold at which such methods become efficient. If one manages to design extremely efficient implementations for the strategies and ( if multi-modular reduction and reconstruction can approximately be done as fast as multiplication itself), then one may also consider the use of Chinese remaindering instead of Fourier transformsin. For large \256>, the FFT-based algorithms should become fastest and we expect that the theoretical |)>> speed-up should result in practical gains of a factor three at least. As emphasized before, it is crucial to rely on inborn FFT strategies. For very large>, we may also run out of -gentle moduli. In that case, we may need to resort to lower values of , with the consequence that some of the lower levels may become somewhat more expensive. A very intriguing question is whether it is possible to select moduli that allow for even faster Chinese remaindering. In the analogue case of polynomials, highly efficient algorithms exist for multi-point evaluation and interpolation at points that form a geometric sequence. Geometric sequences of moduli do not work in our case, since they violate the mutual coprime requirement and they grow too fast anyway. Arithmetic progressions are more promising, although only algorithms with a better constant factor are known in the polynomial case, and the elements of such sequences generically admit small prime factors in common that have to be treated with care. Another natural idea is to chose products *\*m>> of the form -1>, where is highly composite. Each divisor of naturally induces a divisor > of , where > denotes the -th cyclotomic polynomial. For instance, the number 60>-1> is divisible by |)>=2-1>, |)>=2+1>, |)>=2+2+1> and |)>=2-2+1>. Now euclidean division by |)>,\|)>,\|)>> and |)>> is easy since these numbers admit an extremely sparse and regular binary representations. Furthermore, the numbers|)>=m*m> and|)>=m*m> can both be factored into products of two integers of less than bits. Taking =2-1> and =2+1>, it should therefore be possible to design areasonably efficient Chinese remaindering scheme for *m*m*m*m*m> on a64bit architecture. There are several downsides to this approach. Most importantly, the largest prime divisor > of -1> grows quite quickly with , even if is very smooth. For instance, the four largest for which \2> are and the four largest for which \2> are . By construction, the number -1> also admits many divisors if is smooth, so several moduli of this type are generally not coprime. Both obstructions together imply that we can only construct Chinese remaindering schemes with a small number of moduli in this way. For instance, on a 64 bit architecture, one of the best schemes would use and twelve moduli of an average size of bits. In comparison, there are many -gentle and -gentle moduli that are products of prime numbers of approximately bits (or more), and combining a few of them leads to efficient Chinese remaindering schemes for twelve (or more) prime moduli of approximately bits. <\bibliography|bib|plain|all.bib> <\bib-list|10> D.Bernstein. Removing redundancy in high precision Newton iteration. Available from , 2000. D.Bernstein. Scaled remainder trees. Available from , 2004. A.Borodin and R.T. Moenck. Fast modular transforms. , 8:366\U386, 1974. A.Bostan, G.Lecerf, and É.Schost. Tellegen's principle into practice. In , pages 37\U44. ACM Press, 2003. A.Bostan and É.Schost. Polynomial evaluation and interpolation on special sets of points. , 21(4):420\U446, August 2005. Festschrift for the 70th Birthday of Arnold Schönhage. D.G. Cantor and E.Kaltofen. On fast multiplication of polynomials over arbitrary algebras. , 28:693\U701, 1991. J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex Fourier series. , 19:297\U301, 1965. C.M. Fiduccia. Polynomial evaluation via the division algorithm: the fast fourier transform revisited. In A.L. Rosenberg, editor, , pages 88\U93, 1972. M.Fürer. Faster integer multiplication. , 39(3):979\U1005, 2009. J.vonzur Gathen and J.Gerhard. . Cambridge University Press, New York, NY, USA, 3rd edition, 2013. T.Granlund et al. GMP, the GNU multiple precision arithmetic library. , 1991. G.Hanrot, M.Quercia, and P.Zimmermann. The middle product algorithmI. speeding up the division and square root of power series. Accepted for publication in AAECC, 2002. D.Harvey and J.vander Hoeven. On the complexity of integer matrix multiplication. Technical report, HAL, 2014. , accepted for publication in JSC. D.Harvey, J.vander Hoeven, and G.Lecerf. Faster polynomial multiplication over finite fields. Technical report, ArXiv, 2014. , accepted for publication in JACM. D.Harvey, J.vander Hoeven, and G.Lecerf. Even faster integer multiplication. , 36:1\U30, 2016. J.vander Hoeven. Newton's method and FFT trading. , 45(8):857\U878, 2010. J.vander Hoeven and G.Lecerf. Faster FFTs in medium precision. In , pages 75\U82, June 2015. J.vander Hoeven, G.Lecerf, B.Mourrain, etal. Mathemagix, 2002. . J.vander Hoeven, G.Lecerf, and G.Quintin. Modular SIMD arithmetic in Mathemagix. , 43(1):5:1\U5:37, 2016. R.T. Moenck and A.Borodin. Fast modular transforms via division. In , pages 90\U96, Univ. Maryland, College Park, Md., 1972. C.H. Papadimitriou. . Addison-Wesley, 1994. The PARIGroup, Bordeaux. , 2012. Software available from . J.M. Pollard. The fast Fourier transform in a finite field. , 25(114):365\U374, 1971. A.Schönhage and V.Strassen. Schnelle Multiplikation groÿer Zahlen. , 7:281\U292, 1971. <\initial> <\collection> <\attachments> <\collection> <\associate|bib-bibliography> <\db-entry|+E00FWmNELVqHvp|book|GaGe13> <|db-entry> J. > <\db-entry|+Li3PZLCHp1NMkS|article|SS71> <|db-entry> V. > <\db-entry|+Li3PZLCHp1NMiT|article|Pol71> <|db-entry> > <\db-entry|+E00FWmNELVqHw6|article|vdH:mul> <|db-entry> J. van der G. > <\db-entry|+Li3PZLCHp1NMgl|inproceedings|MB72> <|db-entry> A. > <\db-entry|+Li3PZLCHp1NMa8|article|BM74> <|db-entry> R.T. > <\db-entry|+Li3PZLCHp1NMaC|inproceedings|BoLeSc:2003:tellegen> <|db-entry> G. É. > <\db-entry|+Kw48AcKSGu6tqG|misc|Bern04> <|db-entry> > > <\db-entry|+Li3PZLCHp1NMnA|article|vdH:fnewton> <|db-entry> > <\db-entry|+Li3PZLCHp1NMZj|misc|Bern:fnewton> <|db-entry> > > <\db-entry|+Li3PZLCHp1NMaM|article|BoSc05> <|db-entry> É. > <\db-entry|+Li3PZLCHp1NMiC|book|Pap94> <|db-entry> Papadimitriou>> <\db-entry|+Li3PZLCHp1NMcz|article|Fur09> <|db-entry> > <\db-entry|+Li3PZLCHp1NMbT|article|CT65> <|db-entry> J.W. > <\db-entry|+E00FWmNELVqHw7|techreport|vdH:zmatmult> <|db-entry> J. van der > , accepted for publication in JSC> <\db-entry|+Li3PZLCHp1NMe3|unpublished|HaQuZi02> <|db-entry> M. P. > I. speeding up the division and square root of power series> > <\db-entry|+Li3PZLCHp1NMdQ|misc|GMP> <|db-entry> > > <\db-entry|+Li3PZLCHp1NMoM|inproceedings|vdH:quad> <|db-entry> G. > <\db-entry|+D0sN7Vd16ykXR6|article|vdH:simd> <|db-entry> G. G. > <\db-entry|+kUaAyzrnpQzCti|techreport|vdH:ffmul> <|db-entry> J. van der G. > , accepted for publication in JACM> <\db-entry|+Li3PZLCHp1NMaq|article|CaKa91> <|db-entry> E. > <\db-entry|+Li3PZLCHp1NMoY|misc|vdH:mmx> <|db-entry> G. B. > > <\db-entry|+Li3PZLCHp1NMiD|manual|PARI> <|db-entry> Group> > <\db-entry|+1xV2GLimqg5PQ5|inproceedings|Fid72> <|db-entry> > > <\references> <\collection> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > <\auxiliary> <\collection> <\associate|bib> GaGe13 SS71 SS71 Pol71 vdH:mul Fid72 MB72 BM74 BoLeSc:2003:tellegen Bern04 vdH:fnewton Bern:fnewton BoSc05 Bern04 Pap94 vdH:mul Pol71 SS71 Fur09 vdH:mul CT65 vdH:mul vdH:zmatmult vdH:fnewton Bern04 HaQuZi02 GMP vdH:quad vdH:quad vdH:simd Bern04 Bern04 vdH:ffmul vdH:ffmul CaKa91 BoSc05 vdH:mmx PARI GaGe13 BoSc05 BoSc05 <\associate|figure> <\tuple|normal> Product modulo one of |x\\> and |y\\> with |n> significant bits. > <\tuple|normal> Product modulo one of |x\\*n,-\*n>> and |y\\-1|)>*n,0>> with |n> significant bits. > <\tuple|normal> Example of a remainder tree with arities |k=2> and |k=3> at the two recursion levels. In the case of a reduction, the remainders are computed top-down. In the case of a reconstruction, they are reconstructed in a bottom-up fashion. > <\tuple|normal> The scaled remainder tree corresponding to Example |->|-0.3em|>|0em||0em|>>, while using fixed point approximations for the scaled remainders. > <\associate|table> <\tuple|normal> List of |6>-gentle moduli for |w=22>, |w=25>, |\=4> and |\\1000000>. > <\tuple|normal> List of |6>-gentle moduli for |w=23>, |w=25>, |\=4> and |\\16000000>. > <\tuple|normal> List of |8>-gentle moduli for |w=22>, |w=25>, |\=4> and |\\10000000>. > <\tuple|normal> List of |6>-gentle moduli for |w=28>, |w=31>, |\=4> and |\\1600000>. Followed by some of the next gentle moduli for which each |m> divides either |2-\> or |2+\>. > <\tuple|normal> List of |6>-gentle moduli for |w=44>, |w=50>, |\=4> and |\\200000>. Followed by some of the next gentle moduli for which each |m> divides either |2-\> or |2+\>. > <\associate|toc> |math-font-series||font-shape||1.Introduction> |.>>>>|> |math-font-series||font-shape||2.Preliminaries> |.>>>>|> |2.1.Integer multiplication |.>>>>|> > |2.2.Euclidean division of integers |.>>>>|> > |2.3.Approximate products modulo one |.>>>>|> > |2.4.Machine arithmetic |.>>>>|> > |math-font-series||font-shape||3.Asymptotically fast Chinese remaindering> |.>>>>|> |3.1.The Chinese remainder theorem |.>>>>|> > |3.2.Naive multi-modular reduction and reconstruction |.>>>>|> > |3.3.Scaled remainders |.>>>>|> > |3.4.Remainder trees |.>>>>|> > |3.5.Specification of the arities of nodes in the remainder tree |.>>>>|> > |3.6.Complexity analysis of multi-modular reconstruction |.>>>>|> > |3.7.Complexity analysis of multi-modular reduction |.>>>>|> > |3.8.Base conversion |.>>>>|> > |3.9.Polynomial analogues |.>>>>|> > |math-font-series||font-shape||4.Gentle moduli> |.>>>>|> |4.1.The base algorithms revisited for special moduli |.>>>>|> > |4.2.The gentle modulus hunt |.>>>>|> > |4.3.The sieving procedure |.>>>>|> > |4.4.Influence of the parameters |s>, |w> and |w> |.>>>>|> > |4.5.Application to matrix multiplication |.>>>>|> > |4.6.Implementation issues |.>>>>|> > |4.7.Alternative moduli |.>>>>|> > |math-font-series||font-shape||Bibliography> |.>>>>|>