> <\body> the tangent Graeffe method>||>|<\doc-note> This paper is part of a project that has received funding from the French \PAgence de l'innovation de défense\Q. ||<\author-affiliation> Department of Mathematics Simon Fraser University 8888 University Drive Burnaby, British Columbia V5A 1S6, Canada |<\author-affiliation> CNRS, École polytechnique, Institut Polytechnique de Paris Laboratoire d'informatique de l'École polytechnique (LIX, UMR 7161) 1, rue Honoré d'Estienne d'Orves Bâtiment Alan Turing, CS35003 91120 Palaiseau, France |>>|||<\author-affiliation> Department of Mathematics Simon Fraser University 8888 University Drive Burnaby, British Columbia V5A 1S6, Canada |>>|<\doc-note> > <\hide-preamble> >> Let be a prime of the form *2+1> with > small and let > denote the finite field with elements. Let > be a polynomial of degree in > with distinct roots in >. For 2+1> we can compute the roots of such polynomials of degree >. We believe we are the first to factor such polynomials of size one billion. We used amulti-core computer with two 10 core Intel Xeon E5 2680 v2 CPUs and 128 gigabytes of RAM. The factorization takes just under 4,000 seconds on 10 cores and uses 121 gigabytes of RAM. We used the tangent Graeffe root finding algorithm from which is a factor of > faster than the Cantor\UZassenhaus algorithm. We implemented the tangent Graeffe algorithm in C using our own library of 64 bit integer FFT based in-place polynomial algorithms then parallelized the FFT and main steps using Cilk C. In this article we discuss the steps of the tangent Graeffe algorithm, the sub-algorithms that we used, how we parallelized them, and how we organized the memory so we could factor a polynomial of degree >. We give both a theoretical and practical comparison of the tangent Graeffe algorithm with the Cantor\UZassenhaus algorithm for root finding. We improve the complexity of the tangent Graeffe algorithm by afactor of 2. We present a new in-place product tree multiplication algorithm that is fully parallelizable. We present some timings comparing our software with Magma's polynomial factorization command. Polynomial root finding over smooth finite fields is a key ingredient for algorithms for sparse polynomial interpolation that are based on geometric sequences. This application was also one of our main motivations for the present work. > Consider a polynomial function \\> over a field > given through a black box capable of evaluating at points in >. The problem of is to recover the representation of \,\,x|]>> in its usual form, as a linear combination <\equation> f=i\t>c*\> of monomials >=x>*\*x>>. One popular approach to sparse interpolation is to evaluate at points in a geometric progression. This approach goes back to work of Prony in the eighteen's century and became well known after Ben-Or and Tiwari's seminal paper. It has widely been used in computer algebra, both in theory and in practice; see for a nice survey. More precisely, if a bound for the number of terms is known, then we first evaluate at pairwise distinct points ,\,\,\>, where =,\,\|)>\\> and \,\,\|)>> for all\>. The generating function of the evaluations at > satisfies the identity <\equation*> \>f|)>*z=i\t>\>c*\*k>*z=i\t>|1-\>*z>=|\>, where =>*z|)>*\*>*z|)>> and \> is of degree t>. The rational function > can be recovered from |)>,f|)>,\,f|)>> using fast Padé approximation. For well chosen points >, it is often possible to recover the exponents > from the values >\\>. If the exponents > are known, then the coefficients > can also be recovered using atransposed version of fast multipoint interpolation. This leaves us with the question how to compute the roots >> of> in an efficient way. For practical applications in computer algebra, we usually have =\>, in which case it is most efficient to use a multi-modular strategy. This means that we rather work with coefficients in afinite field =\>, where is a prime number that we are free to choose. It is well known that polynomial arithmetic over > can be implemented most efficiently using FFTs when the order of the multiplicative group is smooth. In practice, this prompts us to choose of the form +1> for some small and such that fits into amachine word. The traditional way to compute roots of polynomials over finite fields is using Cantor and Zassenhaus' method. In, alternative algorithms were proposed for our case of interest when is smooth. The fastest algorithm was based on the and it gains a factor with respect to Cantor\UZassenhaus' method. The aim of the present paper is to report on a parallel implementation of this new algorithm and on a few improvements that allow for a further constant speed-up. A preliminary version of this paper was published at ICMS 2020. The new contributions in this paper include the theoretical contributions in sections, , and, the \ implementation details in section and the (parallel) timing results in section. In section, we start by recalling generalities about the Graeffe transform and the heuristic root finding method based on the tangent Graeffe transform from. In section, we present the main new theoretical improvements, which all rely on optimizations in the FFT-model for fast polynomial arithmetic. Our contributions are threefold: <\itemize> In the FFT-model, one backward transform out of four can be saved for Graeffe transforms of order two (see section). When composing a large number of Graeffe transforms of order two, \PFFT caching\Q can be used to gain another factor of > (see section). All optimizations still apply in the TFT model, which can save a factor between one and two, depending on the number of roots (see section). In section we also indicate how to generalize our methods to Graeffe transforms of general orders. In section we determine how much faster the tangent Graeffe algorithm is than the Cantor\UZassenhaus algorithm. To do this, we determine the constant factors in the complexities of both algorithms, under the assumption that arithmetic in> is done using FFT-based algorithms. We first implemented a sequential version of the tangent Graeffe method inC, with the optimizations from sections and; see. Section is devoted to a more elaborate parallel implementation in Cilk C. We detail how we parallelized the different steps of the algorithm, and how we organized the memory so we could factor a polynomial of degree > over >, for 2+1>. We believe we are the first to factor such large polynomials. Our code is available here <\quote-env> In the last section, we give timings. We used amulti-core computer with two 10core Intel Xeon E5 2680 v2 CPUs and 128 gigabytes of RAM. The above factorization of apolynomial of degree > then takes just under 4,000 seconds on 10 cores and uses 121 gigabytes of RAM. Let > be an effective field. Throughout the paper, time complexities count arithmetic operations in the field > and space complexities are for elements of >. We use > to denote the time for multiplying two polynomials in > of degree d>. We make the customary assumption that /n> is a non-decreasing function that tends to infinity. We will chiefly be interested in the case when =\>, where is a prime of the form *2+1> with > small. We call primes of this form . Some examples that we use and their bit lengths are as follows: <\equation*> |||||||||||||>|2+1>>|2+1>>|2+1>>|29\2+1>>|101\2+1>>>| p>>|||||>>>>> For =\> and *2+1> of this form and \2*d>, we have =O>, by using FFT-multiplication. Let > be a polynomial of degree in > which has distinct roots in >. The tangent Graeffe algorithm computes the roots of >. The cost of the algorithm depends on a parameter *2> with j\k>. The parameter determines what proportion of the roots are found in each iteration of the algorithm. The space complexity of the algorithm is > and the average time complexity is <\equation*> O*log |)>++*log d|)>. Theoretically, choosing d*> yields the best time complexity. However, because we want to factor polynomials with very large , our implementation chooses a smaller in the interval > to save space. For > the time complexity is <\equation*> O*log |)>++*log d|)>=O*log p|)>. In this section, we recall the tangent Graeffe algorithm from. In the next section, we will analyze its complexity in the FFT-model and present several improvements. Let > be a general field. The traditional of a monic polynomial \> of degree is the unique polynomial \\> of degree such that <\equation> G|)>=*P*P. If is monic, then so is >. If splits over > into linear factors |)>*\*|)>>, then one has <\equation*> G=|)>*\*|)>. More generally, given 2>, we define the > to be the polynomial \\> of degree given by <\equation*> G=Res,z-u|)> If |)>*\*|)>>, then <\equation*> G=|)>*\*|)>. If > is a primitive -th root of unity in >, then we also have <\equation> G|)>=*d>*P*P*z|)>*\*P*z|)>. If 2>, then we finally have <\equation> G=G\G=G\G. Let > be a formal indeterminate with =0>. Elements in |]>/|)>> are called . They are of the form > with \> and basic arithmetic operations go as follows: <\eqnarray*> |)>\|)>>||c|)>+d|)>*\>>||)>*|)>>||*\>>>> Now let \> be a monic polynomial of degree that splits over >: <\equation*> P=|)>*\*|)>, where ,\,\\\> are pairwise distinct. Then the \P|)>> satisfies <\equation*> =P+P*\=-\|)>|)>*\*-\|)>|)>. The definitions from the previous subsection readily extend to coefficients in |]>> instead of >. Given 2>, we call |)>> the of of order . We have <\equation*> G|)>=-\|)>|)>*\*-\|)>|)>, where <\equation*> -\|)>=\-r*\*\,k=1,\,d. Now assume that we have an efficient way to determine the roots ,\,\> of >. For some polynomial \>, we may decompose <\equation*> G|)>=G+T*\ For any root > of >, we then have <\eqnarray*> |)>-r*\*\|)>>|||)>+|)>-G|)>*r*\|)>*\>>||||)>-G|)>*r*\|)>*\>>|||>>> Whenever > happens to be a single root of >, it follows that <\equation*> r*\=|)>|G|)>>. If \0>, this finally allows us to recover > as <\equation*> \=r*|r*\>=r**G|)>|T|)>>. Assume now that =\> is a finite field, where is a prime number of the form *2+1>> for some small >. Assume also that \\> is a primitive element of order for the multiplicative group of >. Let |)>*\*|)>\\> be as in the previous subsection. The tangent Graeffe method can be used to efficiently compute those > of for which> is a single root of >. In order to guarantee that there are a sufficient number of such roots, we first replace> by |)>> for a random shift \\>, and use the following heuristic: <\description> For any subset ,\,\|}>\\> of cardinality and any />, there exist at least elements \\> such that -\|)>,\,-\|)>|}>> contains at least elements. For a random shift \\> and any />, the assumption ensures with probability at least that |)>|)>> has at least single roots. Now take to be the largest power of two such that /> and let /r>>. By construction, note that >. The roots ,\,\> of > are all -th roots of unity in the set ,\,\|}>>, where =\>. We may thus determine them by evaluating > at > for ,s-1>>. Since >, this can be done efficiently using a discrete Fourier transform. Combined with the tangent Graeffe method from the previous subsection, this leads to the following probabilistic algorithm for root finding: <\specified-algorithm> |||>|\> of degree and only order one factors, *2+1>>>|>|,\,\|}>> of roots of >>|>|>>>> <|specified-algorithm> <\enumerate> If then return > Let \2>> be largest such that /> and let /r> Pick \\> at random and compute >\P|)>\\>|)>>> Compute \P>|)>=P>+P>*\\|]>/|)>|)>> For ,N>, set \G|)>\|]>/|)>|)>>*log |)>|)>>> Let \\>> be of order and write =A+B*\> Compute |)>>, |)>>, and |)>> for ,s-1>|)>>> If |)>=0>, then set |}>>, else set \> For \,\,\|}>> do>> If |)>=0> and |)>\0>, then set S\*A|)>/B|)>+\|}>> Compute \S>|)>>*log d|)>>> Compute P/Q>|)>>> Recursively determine the set of roots > of and return S> <\remark> To compute |)>=G|)>> we may use *G|)>|)>=A*A+*B+B*A|)>*\>, which requires three polynomial multiplications in > of degree . In total, step 5 therefore performs =O|)>> such multiplications. We discuss how to perform step efficiently in the FFT model in section. <\remark> For practical implementations, one may vary the threshold /> for and the resulting threshold 4*d> for . For larger values of , the computations of the DFTs in step get more expensive, but the proportion of single roots goes up, so more roots are determined at each iteration. From an asymptotic complexity perspective, it would be best to take d*>. In practice, we actually preferred to take the threshold 2*d>, because the constant factor of our implementation of step(based on Bluestein's algorithm) is significant with respect to our highly optimized implementation of the tangent Graeffe method. A second reason we prefer of size instead of |)>> is that the total space used by the algorithm is linear in . <\remark> For the application to sparse interpolation, it is possible to further speed up step for the top-level iteration, which is the most expensive step. More precisely, for apolynomial with terms, the idea is to take =0> and > of order t> instead of > for some constant with c\3>. This reduces (and the cost of the top-level iteration) by a factor of >. For the recursive calls, we still need to work with a primitive root of unity> of order and random shifts. Assume that \> is invertible in > and let \\> be a primitive -th root of unity. Consider a polynomial +a*z+\+a*z\\>. Then the discrete Fourier transform (DFT) of order of the sequence |)>i\n>> is defined by <\equation*> DFT>|)>i\n>|)>\|)>k\n>,\A|)>. We will write > for the cost of one discrete Fourier transform in terms of the number of operations in>. We assume that |)>>. For any ,n-1|}>>, we have <\equation> DFT>|)>k\n>|)>=k\n>*\=j\n>a*k\n>\*k>=n*a. If is invertible in >, then it follows that >=n*DFT>>. The costs of direct and inverse transforms therefore coincide up to a factor >. If *n> is composite, k\n>, and k\n>, then we have <\eqnarray*> *n+k>>||i\n>i\n>a*n+i>*\*n+i|)>**n+k|)>>>>|||i\n>\*k>*i\n>a*n+i>*\*n*k>|]>*\*k*n>>>|||i\n>*k>*DFT>>*n+i>|)>i\n>|)>>|]>*\**n+k|)>>>>|||>>*k>*DFT>>*n+i>|)>i\n>|)>>|)>i\n>|)>>.>>>> This shows that a DFT of length reduces to > transforms of length > plus > transforms of length > plus multiplications in >: <\equation*> *n|)>\n*|)>+n*|)>+O. In particular, if >, then \r*>. It is sometimes convenient to apply DFTs directly to polynomials as well; for this reason, we also define >\|)>k\n>>. Given two polynomials \> with \n>, we may then compute the product using <\eqnarray*> ||>>*DFT>|)>.>>>> In particular, we obtain \3*\6*>, where we recall that > stands for the cost of multiplying two polynomials of degree n>. <\remark> In Algorithm, we note that step comes down to the computation of three DFTs of length . Since is a power of two, this length is of the form *2> for some \>. In view of(), we may therefore reduce step to > DFTs of length > plus 2> DFTs of length >. If > is very small, then we may use a naive implementation for DFTs of length >. In general, one may use Bluestein's algorithm to reduce the computation of a DFT of length > into the computation of a product in />-1|)>>, which can in turn be computed using FFT-multiplication and three DFTs of length a larger power of two. Let > be a field with a primitive >-th root of unity >. Let \> be a polynomial of degree n>. Then the relation() yields <\equation> G|)>=*DFT>>|)>*DFT>|)>|)>. For any ,2*n-1|}>>, we further note that <\equation> DFT>|)>=P|)>=P rem |2*n|\>>|)>=DFT>|)> rem |2*n|\>>, so >|)>> can be obtained from >> using transpositions of elements in >. Concerning the inverse transform, we also note that <\equation*> DFT>|)>|)>=G|)>=DFT>|)>, for ,n-1>. Plugging this into(), we conclude that <\equation*> G=*DFT>>*DFT>|)>k\n>|)>. This leads to the following algorithm for the computation of >: <\specified-algorithm> ||>|\> with n> and a primitive >-th root of unity \\>>>|>|>>>>>> <|specified-algorithm> <\enumerate> Compute |)>k\2*n>\DFT>> For ,n-1>, compute \**> Return >|)>k\n>|)>> <\proposition> Let \\> be a primitive |2*n|\>>-th root of unity in> and assume that is invertible in >. Given a monic polynomial \> with n>, we can compute > in time <\equation*> \3*. <\proof> We have already explained the correctness of Algorithm. Step 1 requires one forward DFT of length and cost =2*+O>. Step 2 can be done in linear time >. Step 3 requires one inverse DFT of length and cost +O>. The total cost of Algorithm is therefore +O\3*>. <\remark> In terms of the complexity of multiplication, we obtain \*.> This gives a improvement over the previously best known bound \*>> that was used in. Note that the best known algorithm for computing squares of polynomials of degree n> is *>>. It would be interesting to know whether squares can also be computed in time *>>. In view of(), Graeffe transforms of power of two orders > can be computed using <\equation> G>=|m\>\G|)>. Now assume that we computed the first Graeffe transform > using Algorithm and that we wish to apply a second Graeffe transform to the result. Then we note that <\equation> DFT>|)>=DFT>|)>= is already known for ,n-1>. We can use this to accelerate step 1 of the second application of Algorithm. Indeed, in view of() for =2> and =n>, we have <\equation> DFT>|)>=DFT>*G|)>i\n>|)> for ,n-1>. In order to exploit this idea in a recursive fashion, it is useful to modify Algorithm so as to include >> in the input and >|)>> in the output. This leads to the following algorithm: <\specified-algorithm> ||>|\> with n>, a primitive >-th root of unity \\>,>>|||)>k\n>=DFT>>>>|>|> and >|)>>>>>>> <|specified-algorithm> <\enumerate> Set |)>k\n>\|)>k\n>> Set |)>k\n>\DFT>*P|)>i\n>|)>> For ,n-1>, compute \**> Return >|)>k\n>|)>> and |)>k\n>> <\proposition> Let \\> be a primitive |2*n|\>>-th root of unity in> and assume that is invertible in >. Given a monic polynomial \> with n> and 1>, we can compute >> in time <\equation*> >\*. <\proof> It suffices to compute >> and then to apply Algorithm recursively, times. Every application of Algorithm now takes +O\2*> operations in >, whence the claimed complexity bound. <\remark> In, Graeffe transforms of order > were directly computed using the formula(), using 4*m*> operations in >. The new algorithm is twice as fast for large. The algorithms from subsections and readily generalize to Graeffe transforms of order > for arbitrary 2>, provided that we have an >-th root of unity \\>. For convenience of the reader, we specified the generalization of Algorithm below, together with the resulting complexity bounds. <\specified-algorithm> ||>|\> with n>, 2>, a primitive >-th root of unity \\>,>>|||)>k\n>=DFT>>>>|>|> and >|)>>>>>>> <|specified-algorithm> <\enumerate> Set |)>k\n>\|)>k\n>> For ,r-1>, set |)>k\n>\DFT>*P|)>i\n>|)>> For ,n-1>, compute \*d>***\**n>> Return >|)>k\n>|)>> and |)>k\n>> <\proposition> Let \\> be a primitive >-th root of unity in>, where 2> is invertible in>. Given a monic polynomial \> with n> and 1>, we can compute >> in time <\equation*> >\*. <\proof> Straightforward generalization of Proposition. <\corollary> Let \\> be a primitive *\*r>*n|)>>-th root of unity in>, where \2,\,r>\2> are invertible in>. Given a monic polynomial \> with n> and ,\,m>\\>, we can compute >*\*r>>>>> in time <\equation*> >*\*r>>>>\*m+\+r>*m>+\|)>*. <\proof> Direct consequence of(). <\remark> In our application to root finding, we are interested in the efficient computation of Graeffe transforms of high order >. In terms of the size > of >, it is instructive to observe that the \Paverage cost\Q <\equation*> >=>|log r*>\ is minimal for . This suggests that it might be interesting to use Graeffe transforms of order three whenever possible. In the application of Algorithm, this would lead us to take primes of the form \2\3+1>, with > small and \2> close to . This still allows us to use radix 2 FFTs, while at the same time benefitting from radix 3 Graeffe transforms. If =\> is a fixed finite field, then DFTs are most efficient for sizes that divide . For our root finding application, it is often convenient to take 2+1>, in which case should be a power of two or three times a power of two. The truncated Fourier transform was developed for the multiplication of polynomials such that the degree of the product does not have a nice size of this type. It turns out that we may also use it for the efficient computation of Graeffe transforms of polynomials of arbitrary degrees. Moreover, the optimizations from the previous subsections still apply. Let us briefly describe how the truncated Fourier transform can be used for the computation of Graeffe transforms of power of two orders. With the notations from subsections and, we assume that >> is a power of two as well and that we wish to compute the Graeffe transform of a polynomial of degree t> with t\n>>. Let >> denote the reversal of a binary number ,2*n-1|}>> of > bits. For instance, =12> and =40>. Then the truncated Fourier of at order t> is defined by <\equation*> TFT,T>\>>|)>,P>>|)>,\,P>>|)>|)>. It has been shown in that \TFT,T>> and ,T>|)>> can both be computed in time*>. More generally, for direct transforms, one may compute <\equation*> TFT,\,T>\|]>>>|)>,P+1|]>>>|)>,\,P+T-1|]>>>|)>|)> in time*>, whenever \\\+T\n>. For generalizations to arbitrary radices, we refer to. Taking , we note that <\equation*> P>>|)>=P>+n/2>|)>=P>>|)> for ,t-1>. This provides us with the required counterpart of() for retrieving ,2*t>|)>> efficiently from ,2*t>>. The relation() also has a natural counterpart: <\equation*> TFT,2*t>|)>=G>>|)>=G-1>>|)>=TFT,t>|)>, for ,t-1>. This leads to the following refinement of Algorithm: <\specified-algorithm> ||>|\> with t\n=2-1>>,>>||>-th root of unity \\>, and |)>k\t>=TFT,t>>>>|>|> and ,t>|)>>>>>>> <|specified-algorithm> <\enumerate> Set |)>k\t>\|)>k\t>> Set |)>k\t>\TFT,t,t>> For ,t-1>, compute \**> Return ,t>|)>k\t>|)>> and |)>k\t>> <\proposition> Let \\> be a primitive |2*n|\>>-th root of unity in>, where >>, and assume that is invertible in >. Given a monic polynomial \> with deg P\t\n> and 1>, we can compute >> in time <\equation*> >\**. <\proof> Straightforward adaptation of the proof of Proposition, while using. In step of Algorithm, we still need an algorithm to compute the Taylor shift |)>>. If the characteristic of > exceeds , then it is (not so) well known3> that this can be done in time +O>, using the following reduction to a single polynomial multiplication of degree : <\specified-algorithm> ||>|\> of degree char \> and \\>>>|>||)>>>>>>> <|specified-algorithm> <\enumerate> 0!*P+1!*P*z+\+d!*P*z> \z*L> 1+\*z+*\*z+\+*\*z> |~>\*E rem z> \z*|~>> Return *\+*\*z+\+*\*z> It is interesting to observe that Taylor shifts can still be computed in time |)>> in small characteristic, as follows: <\specified-algorithm> ||>|\> of degree p=char \\0> and \\>>>|>||)>>>>>>> <|specified-algorithm> <\enumerate> Define =z>> for ,k> where |log d/log p|\>> Rewrite ,\,z|)>\\,\,z|]>> with > P\p> for ,k-1> For ,k>, replace \,\,z,z+\>,z,\,z|)>> Return ,\,z>|)>> We give a theoretical comparison of Algorithm with the Cantor\UZassenhaus algorithm, where both algorithms have been optimized in the \PFFT model\Q. For this comparison, it is convenient to replace the \Pworst case\Q heuristic by a more empirical assumption. More precisely, if we take \*d> for \1>, then the expected proportion of single roots is>>(see). This expected proportion is indeed observed in practice: see Table. In Algorithm, we took=4>. <\proposition> Consider Algorithm with the modification that we take \*d> instead of 4*d> for some fixed \1>. Then the expected cost of Algorithm is bounded by <\equation> *\>*log|)>+*log d+O|)>*. <\proof> We first analyze the cost of step. Let > be the cost of the polynomial multiplications in the product tree algorithm, where is the size of . Exercise 10.3 of shows that \**log k+O>. The recurrence for > is =>+> for 1> and =0>. Solving this recurrence while using the assumption that />> is non-decreasing yields =**log k+O|)>>. Now let ,\,k>>> be the successive sizes of for the recursive calls of the algorithm, so that +\+k>=d>>. Thus the cost of all executions of step is |)>+\+>|)>>. Using again that />> is non-decreasing, we have |)>+\+>|)>\=**log d+O|)>>. Let us next analyze the cost of the other steps until step inclusive, but without the recursive calls in step. Set log /s|)>>. <\itemize> To compute|)>> in step costs |)>>; using3>. Step4 takes > to compute>|)>>. By Proposition, step costs *>+O>, which is equivalent to*N*>. Step is |)>=O|)>> using. The division in step is |)>> using fast division. Altogether, the cost of these steps is *log|)>+O|)>*>. We now need to account for the cost of the recursive calls in step. For \*d> the tangent Graeffe algorithm will, under our hypothesis, on average, obtain at least >> of the roots in each recursive call, leaving slightly less than \1-\>> of them to be found. Since >\=>=\>> and *log d> is non-decreasing, the total cost of Algorithm is therefore as claimed. <\remark> In the asymptotic region where >, the bound() reduces to *\>+o|)>*log|)>*>. Since we may pick > as large as we want, the complexity of Algorithm is then bounded by +\|)>*log|)>*> for any \0>. Assume from now on that we are in the asymptotic region where >. Then Remark shows that the cost of the tangent Graeffe method is +\|)>*log|)>*> for any \0>. The bottleneck of Cantor\UZassenhaus' algorithm in this region is modular exponentiation, which accounts for atotal cost of *log p*log d|)>>. We wish to determine the constant factor in order to give an accurate comparison between the two algorithms. Given polynomials >> with deg P=d>, one modular exponentiation does >> modular squarings <\equation*> R\S rem P in a loop. However, in the first few iterations, the degree of > is less then so no division is needed. The number of steps with divisions is |)>>. Using fast division, the remainder can be computed with two multiplications of size assuming the required inverse of >, the reciprocal of , is precomputed. In the FFT model, one forward transform of may be saved and the forward transform of the inverse of> may be cached. This costs +O>, at least *+O>, since2*d-1>>. The cost of the top-level modular exponentiation is therefore equivalent to ** |)>>>. Using a similar recursion as for > in the proof of Proposition, the total cost of all modular compositions in CZ is equivalent to ** |)>>*log d> (strictly speaking, this is really an upper bound; but we indeed obtain an equivalence for common values of>, such as or ). Altogether, this means that the tangent Graeffe algorithm is about*log t> times faster than Cantor\UZassenhaus. <\remark> In practice, we often have log d>, and the complexity of Cantor\UZassenhaus also involves another *log d|)>> term due to gcd computations. The corresponding constant factor is fairly high, which makes the tangent Graeffe algorithm even more favorable in this asymptotic region. For our implementation, we have been pursuing three goals. We first wanted to reimplement the tangent Graeffe algorithm from scratch using FFT-based polynomial arithmetic and compare it with a good implementation of the Cantor\UZassenhaus algorithm. Second, we wanted to factor a polynomial of degree >. For this, it is important to consider space efficiency. Third, we wanted to parallelize the tangent Graeffe algorithm for a multi-core computer. Unlike the Cantor\UZassenhaus algorithm, the tangent Graeffe algorithm is gcd-free which makes it easier to parallelize. We first tried parallelizing only the FFTs that appear in steps , , , , and of Algorithm to see if this alone is sufficient to obtain good parallel speedup. It isn't. Our main practical challenge is, however, space, not time. For a 64 bit prime , and an input polynomial of degree >, the input polynomial and output roots alone need 16 gigabytes of storage. Our first implementation exceeded the virtual memory of our machine which has 128 gigabytes of RAM plus 240 gigabytes of SSD swap space. To reduce space we use in-place algorithms. In-place algorithms do not allocate new space (memory). They work in the input and output space and we allow them to be given additional working space. To parallelize a recursive in-place algorithm, we must partition the input, output and the temporary memory (if any) for recursive parallel calls. Our design is such that the space used does not increase with more cores. We refer the reader to Giorgi, Grene and Roche for examples of in-place serial algorithms in computer algebra and a bibliography. <\hide-preamble> >> Number theoretic FFTs over > were first introduced by Pollard. Let us first explain how we parallelized such FFTs using Cilk C. The strategy is classical, but it is convenient to detail it here, since we use the same strategy for parallelizing the other steps in Algorithm. Cilk was designed to support the parallelization of recursive divide and conquer algorithms like the FFT for multi-core computers. To parallelize recursive algorithms using Cilk we first modify the code so that the memory locations that are updated in the recursive calls do not overlap. Figure shows our sequential C code for two radix 2 FFTs that we use. In the vernacular is known as the \Pdecimation-in-time\Q FFT and as the \Pdecimation-in-frequency\Q FFT. We refer the reader to Chu and George for a discussion of the two FFTs. We note that Geddes, Czapor, and Labahn present only in their book on computer algebra, whereas von zur Gathen and Gerhard present only . In the code in Figure , is a macro for , a 64 bit signed integer. The C functions , , implement > in > respectively for p\2>>. For multiplication in > we use Roman Pearce's implementation of Möller and Granlund. The input array of size is the main input to and output from the FFT. The input is an array of size containing powers of >, a primitive -th root of unity in >. We precompute <\equation*> W=1,\,\,\,\-1>,1,\,\,\,\-2>,1,\,\,\,\-4>,\,1,0|]>. Precomputing saves asymptotically half of the multiplications in the FFT. For >, and both do *> multiplications. Duplicating of powers of > means all recursive calls in and access sequentially.|||||>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>|>|>>>>> >|C code for two FFTs over >.>> Codes like these where the recursive calls update separate parts of the array are easy to parallelize using Cilk. To parallelize in Cilk C we first make a new subroutine which executes the two recursive calls in parallel. In Figure the two Cilk directives do this. For small we do not want to start two new processes because of the cost of process management overhead. Thus for 2>, calls which runs sequentially. <\with|font-size|0.84> <\verbatim> \; \ \ \ cilk void block1( LONG n2, LONG *A, LONG b, LONG W, LONG p ) { \ \ \ \ \ \ LONG i,s,t; \ \ \ \ \ \ for( i=0; i\b; i++ ) { \ \ \ \ \ \ \ \ \ s = A[i]; \ \ \ \ \ \ \ \ \ t = mulmod(W[i],A[n2+i],p); \ \ \ \ \ \ \ \ \ A[ \ \ i] = addmod(s,t,p); \ \ \ \ \ \ \ \ \ A[n2+i] = submod(s,t,p); \ \ \ \ \ \ } \ \ \ \ \ \ return; \ \ \ } \ \ \ #define B 65536 \ \ \ #define FFTCUTOFF 32768 \ \ \ cilk void fft1cilk( LONG *A, LONG n, LONG *W, LONG p ) { \ \ \ \ \ \ LONG q,r,i; \ \ \ \ \ \ if( n\=FFTCUTOFF ) { fft1(A,n,W,p); return; } \ \ \ \ \ \ spawn fft1cilk( A, \ \ \ n2, W+n2, p ); \ \ \ \ \ \ spawn fft1cilk( A+n2, n2, W+n2, p ); \ \ \ \ \ \ sync; // wait for the two fft1cilk calls to finish \ \ \ \ \ \ n2 = n/2; q = n2/B; r = n2-q*B; \ \ \ \ \ \ for( i=0; i\q; i++ ) spawn block1( n2, A+i*B, B, W+i*B, p ); \ \ \ \ \ \ if( r\0 ) spawn block1( n2, A+q*B, r, W+q*B, p ); \ \ \ \ \ \ sync; // wait for all blocks to complete \ \ \ \ \ \ return; \ \ \ } Cilk C code for parallelizing >> To obtain good parallel speedup we also need to parallelize the for loop in because it does > multiplications out of a total of *> multiplications. We do this by executing large contiguous blocks of the loop in parallel. We could use Cilk's parallel for loop construct and let Cilk determine what block size to use but this does not allow us to tune the block size for best performance. Instead we choose a blocksize and parallelize the loop explicitly as shown in Figure . To get best performance, we reduced until the Cilk process overhead takes more than 1% say of the total time for the FFT. This yields good parallel speedup for many cores when is large. An alternative way to parallelize an FFT is to use a recursive FFT for creating parallel tasks for large as we have but use a non-recursive FFT for smaller . These and other strategies for parallelizing the FFT for multi-core computers are discussed in. <\hide-preamble> >> Steps 1, 2, 3, 5, and 6 of Algorithm require > time. Steps 3 and 6 involve inverses in> which, because they require many divisions, are relatively very expensive. All the inverses can be replaced with multiplications as follows. After computing ! in step1, we compute the inverse =>. For steps 3 and 6, we run the loop backwards and multiply by to get the next inverse as follows. <\with|font-size|0.84> <\verbatim> for( i=d; i\1; i-- ) { Pi[i] = mulmod(Pi[i],finv,p); finv = mulmod(finv,i,p); } \; We parallelized only the polynomial multiplication E> in step 4. To multiply two polynomials of degree at most our FFT multiplication needs three arrays , , and of size where > satisfies n\4*d>. The data in Table shows that the best parallel speedup for step 3 is 4.12. A factor of4.12 on 10 cores is not very good and due to the fact that we did not paralellize steps1,3 and6 which involve , and multiplications in >, respectively. We did not do this because computing |)>> is not a bottleneck of Algorithm. Of course, it is possible to paralellize steps 1, 3 and 6. For blocks of size we would need to compute the factorials !,!,\,!|]>> in parallel. We could do this by first computing partial products B!,k,k,\,k|]>> in parallel and then compute . Now we can execute step 1 in blocks in parallel. Next we would compute the inverses of so we can execute steps 3 and 6 in parallel. : the tangent Graeffe loop > Step of Algorithm is the main computation in Algorithm. It has complexity |)>=O*log |)>>. We successfully parallelized the FFTs and all > loops in blocks except the following data movement: <\with|font-size|0.84> <\verbatim> \; \ \ for( i=0; i\n/2; i++ ) A[i] = A[2*i]; \ \ // compress \ \ for( i=0; i\n/2; i++ ) A[n/2+i] = A[i]; // duplicate \; We tried parallelizing the above code by moving the even entries of into a temporary array , in blocks, in parallel, then copying them back to , twice, in blocks, in parallel. On 6 cores, the parallel code is slower than the serial code so we have left this pure data movement not parallelized. : parallelizing the evaluations > Each of the three evaluations |)>>, |)>> and |)>> for i\s> in step of Algorithm can be done using the Bluestein transform in |)>=O|)>> time. Although step does not dominate the time complexity it needs the most space. We need at least units of storage to store the values of |)>>, |)>> and |)>>. For 2+1> and >, the requirement s\8*d> implies 2>. This is gigabytes. We chose to save half of this by choosing s\4*d> instead. This increases the time a little (see Table) because we obtain a smaller portion of the roots at each recursive call of Algorithm. The Bluestein transform does one multiplication of two polynomials of degree plus>work. For a multiplication of degree we need an FFT of size \2*s> and our FFT uses units of temporary storage. For 2+1>, > and 2> we have > so we need 192 gigabytes. We do not evaluate of >, > and > in parallel as that would triple the 192 gigabytes to 576 gigabytes. For +1> with small, because > for some , we can save time and space by applying(). We do > DFTs of size , in blocks, in parallel, then multiplications by powers of >, in blocks in parallel, followed by FFTs of size which we do in parallel. This reduces the time for an evaluation by a factor between 6 and 12 since we replace 3FFTs of size where n\4*s> by FFTs of size which is equivalent to one FFT of size . We need an additional units of storage for storing the inputs of size > for the FFTs and an additional units for the required array. Thus applying() reduces the temporary storage needed from units to units. For 2+1> and >>, with 2>, this is =24> gigabytes of temporary storage instead of 192 gigabytes. We can now state the total storage needed by our implementation of Algorithm. We need to store the input > of degree and an array of size for the output roots . We need space for three temporary polynomials of degree . Including the memory for step, our implementation uses units, which, for +1>, > and 2> is 121 gigabytes. <\hide-preamble> >> : parallelizing the product tree multiplication algorithm> In step we need to expand the polynomial \S>|)>>. Let > and suppose is stored in an array of size and in an array of size . We can compute in *log n|)>> using fast multiplication and divide and conquer. The obvious way to do this is to split the roots in into two equal size groups, of size |n/2|\>> and , then recursively compute <\equation*> a=|)>=x+a*zb=|)>=x+b*z and finally multiply > by > using fast multiplication for large . In the literature this approach is called the product tree multiplication algorithm; see 10.3>. In AppendixA, the function implements the product tree algorithm in Magma. We use it to create the input polynomial > to get Magma timings. However, is inefficient because all recursive calls allocate new memory for the inputs and products in the product tree. For large products we use our FFT based polynomial multiplication which needs a temporary array of size units of storage where \n>. We do not want to allocate space for in each recursive call. Fortunately, is a power of 2 so we can divide into two halves and execute the two recursive calls in parallel using the two halves of . But the size of > plus > is . They don't fit in . The remedy is to note that and are both monic, so we do not need to store their leading coefficients. Figure presents C code for an in-place product tree algorithm which uses the input space of size , the output space of size and a temporary array of size for all polynomial multiplications in the product tree. Our solution splits asymmetrically into a \Plarge\Q set of size > where \m\n> and a \Psmall\Q set of size . For example, for we use and . To multiply b> we first recursively compute \a-x> in > and \b-x> in >. We copy the contents of to the input array and we compute the product \> in |]>>. For small , we use an in-place multiplication with quadratic complexity. Then we add *> and *> to so that contains \a*b-x>. Thus we have the following result. <\proposition> Let > of size and be an array of size and an array of size where > and n>. Assuming >, Algorithm computes \S>z-\> in *log n|)>> arithmetic operations in > using only the memory of and . In Figure, because the memory for , and in the two recursive calls is separated, we may execute the two recursive calls to in parallel. This also holds recursively, so we can keep executing recursive calls in parallel, which is ideal for Cilk. In our parallel implementation of we parallelize the two recursive calls if 2=65536>>. We also parallelize and the FFTs (both the subalgorithms and their calls inside ). We do not parallelize the two polynomial additions.<\float|float|tbh> <\big-figure> <\very-small> <\verbatim> #include \stdlib.h\ #define LONG long long int \; // Input polynomials a and b mod p of degree da and db respectively. // Output array c is space for the sum a + b mod p (product a b mod p) LONG poladd64s( LONG *a, LONG *b, LONG *c, LONG da, LONG db, LONG p ); LONG polmul64s( LONG *a, LONG *b, LONG *c, LONG da, LONG db, LONG p ); LONG FFTmul64s( LONG *a, LONG *b, LONG *c, LONG da, LONG db, LONG *T, LONG p ); \; void treemul( LONG *S, LONG n, LONG *Q, LONG *T, LONG p ) { \ \ LONG i,m,d; \ \ if( n==1 ) { Q[0] = submodp(0,S[0],p); return; } \ \ for( m=1; 2*m\n; m=2*m ); \ \ d = n-m; \ \ treemul(S,m,Q,T,p); \ // compute a(x)-x^m in Q[0..m-1] using S[0..m-1] and T[0..3m-1] \ \ treemul(S+m,d,Q+m,T+3*m,p); \ // and b(x)-x^d in Q[m..n-1] using S[m..n-1] and T[3m..] \ \ for( i=0; i\n; i++ ) S[i] = Q[i]; \ \ // S = [a0,a1,...,am-1,b0,b1,...,bd-1] \ \ if( d\10 ) polmul64s(S,S+m,Q,m-1,d-1,p); // in-place classical multiplication \ \ else FFTmul64s(S,S+m,Q,m-1,d-1,T,p); \ // FFT multiplication using T[0..6m-1] \ \ Q[n-1] = 0; \ \ poladd64s(Q+d,S,Q+d,m-1,m-1,p); \ \ \ // add x^m (b-x^d) to Q[d..n-1] \ \ poladd64s(Q+m,S+m,Q+m,d-1,d-1,p); \ // add x^d (a-x^m) to Q[m..n-1] \ \ // Q[n] = 1; is okay for this sequential code but not for a parallel code \ \ return; } \; LONG *array64s(LONG n); // return an array of size n void treeproduct( LONG *S, LONG n, LONG *Q, LONG p ) { \ \ // S is an array of size n and Q of size n+1 \ \ LONG N,*T; \ \ for( N=1; N\n; N=2*N ); \ \ T = array64s(3*N); \ \ treemul(S,n,Q,T,p); \ \ free(T); \ \ Q[n] = 1; \ \ return; } <|big-figure> C code for computing |)>> in step . <\remark> By dividing the input asymmetrically so that the FFTs are fully utilized, we gain upto a factor of 2 in speed when is much smaller then . Thus we get the benefit of using a truncated FFT for the polynomial multiplications without using it. <\remark> If we insert the statement before the statement in , then the sequential code will still work. However, since both recursive calls update , the memory is no longer separated, so the two recursive calls cannot be executed in parallel. We want to compare our new tangent Graeffe implementation with an implementation of the Cantor\UZassenhaus algorithm which uses fast polynomial arithmetic. One of the best implementations that we know of is Magma's command. Magma uses an implementation of Shoup's work by Allan Steel. For \> of degree with distinct roots in >, Magma uses the Cantor\UZassenhaus algorithm which has time complexity *log d*log p|)>>. The timings in Tables and were obtained on an 8 core Intel Xeon E5-2660 CPU which runs at at 2.2 GHz base and 3.0 GHz turbo. The input polynomials > in Tables and of degree were created with random distinct roots in >. Table is for the 28.8 bit prime 2+1> and Table is for the 62.4 bit prime 29\2+1>. The timings in column NewTG are for our C implementation of Algorithm with the parameter chosen in >. The timings in column Magma are for the command for Magma version V2.25-5. Magma code for creating > and factoring it is given in Appendix. We note that older versions of Magma have a quadratic sub-algorithm; this problem was fixed by Steel for version V2.25-5. We also note that Magma has faster arithmetic for primes 2>. The columns in Tables and labeled Step5, Step6, Step9, and report the time spent in steps , , , and, respectively, in the top level call of Algorithm. They show that initially step , which costs *log >, dominates the cost of steps and which cost |)>> and *log d|)>> respectively. As increases, so does, and the number of iterations of step in Algorithm drops; ultimately, the computation of the product in step dominates. |||||||||||||||||||||>||||||>|||>|>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>||s>|s>|s>|s>|s>|s>|||ms>>|-1>>||s>|s>|s>|s>|s>|s>|||ms>>|-1>>||s>|s>|s>|s>|s>|s>|||ms>>|-1>>||s>|s>|s>|s>|s>|s>|||ms>>>>>>> Timings in CPU seconds for 2+1> using >. >> |||||||||||||||||||||>||||||>|||>|>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>|s>|s>|s>|s>|s>|s>|s>|||ms>>|-1>>||s>|s>|s>|s>|s>|s>|||ms>>|-1>>||s>|s>|s>|s>|s>|s>|||ms>>|-1>>||s>|s>|s>|s>|s>|s>|||ms>>|-1>>||s>|s>|s>|s>|s>|s>|||ms>>>>>>> Timings in CPU seconds for 29\2+1> using >. >> The column labeled %roots reports the percentage of the roots found in the first tangent Graeffe application. Column is the value of in step . Step does FFTs of size . Column FFT reports the time for one of those FFTs. For example, in Table, for -1>, 13.9 seconds was spent in step . The code did 41+2=166> FFTs of size > which took 0.0747=12.4> seconds.\ The timings in Table show that the tangent Graeffe method is 100 times faster than Magma's implementation of Cantor\UZassenhaus at degree -1>. For the larger prime in Table the tangent Graeffe method is 166 times faster at degree -1>. Table shows the effect of different choices for the parameter in Algorithm. If the roots of |)>> under the Graeffe transform >> are uniformly distributed among the -th roots of unity, then the proportion of them that remain distinct is >. Thus doubling increases the proportion of roots found at each recursive level of Algorithm, which saves time, but it also doubles the cost of the evaluations in step . Recall that the theoretically optimal performance is obtained by taking >, where we note that =3.730> and d>=4.480> for >100>000>. Column > gives the expected proportion of roots that will be found. Column%roots is the actual percentage of roots found by Algorithm. Columns Step5, Step6, and Total are the time in step , step , and the total time. The data in Table suggests choosing s\8*d> to minimize time. For s\8*d> the algorithm is expected to obtain between =0.779> and =0.882> of the roots. Our actual choice of s\4*d> increases the time by about 10% but saves a factor of 2 in space. |||||||||||||||||||||||||||||>100>000>>|||||>400>000>>||||>|>|>>|||||>>||||>|d\s\2*d>>|||s>|s>|s>|||s>|s>|s>>|2*d\s\4*d>>|||s>|s>|s>|||s>|s>|s>>|4*d\s\8*d>>|||s>|s>|s>|||s>|s>|s>>|8*d\s\16*d>>|||s>|s>|s>|||s>|s>|s>>|s\32*d>>|||s>|s>|s>|||s>|s>|s>>>>>>> Tangent Graeffe timings in CPU seconds for 2+1> for various .>> Table shows the results for our parallel implementation including timings for > of degree >. The timings in Table were obtained on a server with a 10 core Intel E5 2680 v2 CPU with 128 gigabytes of RAM running at 3.0GHz base, 3.6GHz turbo. The input polynomial > is created as before by choosing random distinct values from > with 2+1>. <\with|font-size|0.84> |||||||||||||||||||||||||||||||||||>||>>||||||>|||>|>>|||||||||||>|>>|||||||||||>||||||||||||>|10>>|||||||||||>|10>>|||||||||||>||||||||||||>|10>>|||||||||||>|10>>|||||||||||>||||||||||||>|10>>|||||||||||>|10>>|||||||||||>||||||||||||>|10>>|||||||||||>|10>>|||||||||||>||||||||||||>|10>>|||||||||||>|10>>|||||||||||>||||||||||||>|>>|||||||||||>|>>|||||||||||>||||||||||||>>>>> \ Real timings (in seconds) for 2+1> using >. >> Timings in column > are for computing the input polynomial > using our in-place parallel product tree multiplication from section 3.4. The total time for computing the roots of > is in column Total. Roughly speaking, our software takes 10 times longer to compute the roots of > (less for larger ) than it does to create >! Timings in columns Step4, Step5, Step6, Step9, and are the times for those steps for the top-level call of Algorithm. Speedups are given for each computation. The reader can see that the parallel speedup is not as good for Step4 and the division . This is because we have only parallelized the FFTs in them; none of steps of linear cost are parallelized. <\hide-preamble> >> The motivation for our work is the problem of sparse polynomial interpolation where one seeks to recover the coefficients and monomials of a polynomial ,\,x|]>> with terms from values of . The popular approach by Prony and Ben-Or\UTiwari needs only values of but it needs to factor a polynomial \> of degree which has distinct roots in >. Using the Cantor\UZassenhaus (CZ) algorithm, computing the roots of > takes *log t*log p|)>> time. Because this is the most expensive step in Ben-Or/Tiwari sparse interpolation, and because CZ does gcd computations which are difficult to parallelize, research in sparse interpolation has tried different approaches that do not require root finding. If we choose the prime of the form *2+1> with > small, the new tangent Graeffe(TG) algorithm factors > in (t)*log p)> time. This is > faster than CZ, but the constants actually matter in practice. In this work we improved the main step of TG by a factor of 2 and we showed that for large , TG is faster than CZ by a factor of *log t>. Our new C implementation of TG is over 100 times faster than Magma's C implementation of CZ for 2> on the tests we made. So TG really is a lot faster than CZ. Another goal was to see if we could parallelize TG for multi-core computers. We found that it was not sufficient to only parallelize the underlying FFT. We also had to parallelize many sub-algorithms to get good parallel speedup. Here we contributed anew parallel in-place product tree algorithm. We were also successful in reducing the space needed so that we could compute the roots of > of degree one billion on a 10 core computer with 128 gigabytes of RAM in about one hour. It should be possible to reduce the time further by using vectorization. The sequential implementation of the FFTs could also be further improved using techniques from and Harvey's approach for number theoretic FFTs. <\with|font-size|0.84> <\verbatim> <\vgroup> \ p := 3*29*2^56+1; \ \ p := 7*2^26+1; \ \ Fp := FiniteField(p); \ \ R\x\ := PolynomialRing(Fp); \; <\vgroup> \ \ mult := function( L ) \ \ \ \ n := #L; \ \ \ \ if n eq 1 then return x-L[1]; \ \ \ \ else \ \ \ \ \ \ \ m := n div 2; \ \ \ \ \ \ \ f := $$( L[1..m] ) * $$( L[m+1..n] ); \ \ \ \ \ \ \ return f; \ \ \ \ end if; \ \ end function; \; <\vgroup> \ \ d := 2^12-1; \ \ S := { Random(Fp) : i in [1..d] }; \ \ while #S lt d do S := S join { Random(Fp) : i in [1..d-#S] }; end while; \ \ #S; \ \ L := [ x : x in S ]; \ \ time f := mult( L ); \ \ time g := Factorization(f); <\bibliography|bib|tm-plain|> <\bib-list|45> A.V.Aho, K.SteiglitzJ.D.Ullman. Evaluating polynomials on a fixed set of points. , 4:533\U539, 1975. M.Ben-OrP.Tiwari. A deterministic algorithm for sparse multivariate polynomial interpolation. , 301\U309. New York, NY, USA, 1988. ACM Press. D.J.Bernstein. , 325\U384. Mathematical Sciences Research Institute Publications. Cambridge University Press, United Kingdom, 2008. LeoI.Bluestein. A linear filtering approach to the computation of discrete Fourier transform. , 18(4):451\U455, 1970. A.Bostan, G.LecerfÉ.Schost. Tellegen's principle into practice. , 37\U44. ACM Press, 2003. R.P.Brent, F.G.GustavsonD.Y.Y.Yun. Fast solution of Toeplitz systems of equations and computation of Padé approximants. , 1(3):259\U295, 1980. J.Canny, E.KaltofenY.Lakshman. Solving systems of non-linear polynomial equations faster. , 121\U128. ACM Press, 1989. D.G.CantorH.Zassenhaus. A new algorithm for factoring polynomials over finite fields. , 36(154):587\U592, 1981. E.ChuA.George. . Computational Mathematics Series. CRC Press, 2000. S.Czapor, K.GeddesG.Labahn. . Kluwer Academic Publishers, 1992. A.DíazE.Kaltofen. FOXFOX: a system for manipulating symbolic objects in black box representation. , 30\U37. ACM Press, 1998. P.Fortin, A.Fleury, F.LemaireM.Monagan. High performance SIMD modular arithmetic for polynomial evaluation. , 2020. F.Franchetti, M.Püschel, Y.Voronenko, S.ChellappaJ.Moura. Discrete Fourier transform on multicores. , 26(6):90\U102, 2009. T.S.Freeman, G.M.Imirzian, E.KaltofenY.Lakshman. DAGWOOD: a system for manipulating polynomials given by straight-line programs. , 14:218\U240, 1988. M.FrigoS.G.Johnson. The design and implementation of FFTW3. , 93(2):216\U231, 2005. Special issue on \PProgram Generation, Optimization, and Platform Adaptation\Q. M.Frigo, C.E.LeisorsonR.K.Randall. The implementation of the Cilk-5 multithreaded language. , 212\U223. ACM, 1998. J.vonzur GathenJ.Gerhard. . Cambridge University Press, New York, NY, USA, 3rd, 2013. P.Giorgi, B.GrenetD.S.Roche. Fast in-place algorithms for polynomial operations: division, evaluation, interpolation. , Arxiv, 2020. B.Grenet, J.vander HoevenG.Lecerf. Randomized root finding over finite fields using tangent Graeffe transforms. , 197\U204. New York, NY, USA, 2015. ACM. B.Grenet, J.vander HoevenG.Lecerf. Deterministic root finding over finite fields using Graeffe transforms. , 27(3):237\U257, 2016. D.Harvey. Faster arithmetic for number-theoretic transforms. , 60:113\U119, 2014. J.vander Hoeven. The truncated Fourier transform and applications. J.Gutierrez, , 290\U296. Univ. of Cantabria, Santander, Spain, July 4\U7 2004. J.vander Hoeven. Probably faster multiplication of sparse polynomials. , HAL, 2020. . J.vander HoevenG.Lecerf. On the bit-complexity of sparse polynomial and series multiplication. Symbolic Comput.>, 50:227\U254, 2013. J.vander HoevenG.Lecerf. Sparse polynomial interpolation in practice. , 48(3/4):187\U191, 2015. J.vander Hoeven, G.LecerfG.Quintin. Modular SIMD arithmetic in Mathemagix. , 43(1):5\U1, 2016. J.vander HoevenM.Monagan. Implementing the tangent Graeffe root finding method. A.M.Bigatti, J.Carette, J.H.Davenport, M.JoswigT.de Wolff, , 482\U492. Cham, 2020. Springer International Publishing. J.vander Hoeven etal. GNU TeXmacs. , 1998. J.HuM.B.Monagan. A fast parallel sparse polynomial GCD algorithm. S.A.Abramov, E.V.ZimaX.-S.Gao, , 271\U278. ACM, 2016. M.A.HuangA.J.Rao. Interpolation of sparse multivariate polynomials over large finite fields with applications. , 508\U517. Philadelphia, PA, USA, 1996. Society for Industrial and Applied Mathematics. M.JavadiM.Monagan. Parallel sparse polynomial interpolation over finite fields. , 160\U168. ACM Press, 2010. E.Kaltofen. Computing with polynomials given by straight-line programs I: greatest common divisors. , 131\U142. ACM Press, 1985. E.Kaltofen, Y.N.LakshmanJ.-M.Wiley. Modular rational sparse multivariate polynomial interpolation. , 135\U139. New York, NY, USA, 1990. ACM Press. E.KaltofenB.M.Trager. Computing with polynomials given by black boxes for their evaluations: greatest common divisors, factorization, separation of numerators and denominators. , 9(3):301\U320, 1990. E.KaltofenL.Yagati. Improved sparse multivariate polynomial interpolation algorithms. , 467\U474. Springer Verlag, 1988. R.Larrieu. The truncated Fourier transform for mixed radices. , 261\U268. New York, NY, USA, 2017. ACM. R.Moenck. Fast computation of GCDs. , 142\U171. New York, 1973. ACM Press. N.MöllerT.Granlund. Improved division by invariant integers. , 60:165\U175, 2011. H.MuraoT.Fujise. Modular algorithm for sparse multivariate polynomial interpolation and its parallel implementation. , 21:377\U396, 1996. J.M.Pollard. The fast Fourier transform in a finite field. , 25(114):365\U374, 1971. R.Prony. Essai expérimental et analytique sur les lois de la dilatabilité des fluides élastiques et sur celles de la force expansive de la vapeur de l'eau et de la vapeur de l'alkool, à différentes températures. , 1(cahier 22):24\U76, 1795. M.Püschel, J.M.F.Moura, J.Johnson, D.Padua, M.Veloso, B.Singer, J.Xiong, F.Franchetti, A.Gacic, Y.Voronenko, K.Chen, R.W.JohnsonN.Rizzolo. SPIRAL: code generation for DSP transforms. , 93(2):232\U275, 2005. D.S.Roche. What can (and can't) we do with sparse polynomials? , 25\U30. New York, NY, USA, 2018. ACM. V.Shoup. A new polynomial factorization and its implementation. , 20(4):363\U397, 1995. A.Steel. Private communication. <\initial> <\collection> <\attachments> <\collection> <\associate|bib-biblio> <\db-entry|+if2lDqI1qmMqAaI|inproceedings|CKL89> <|db-entry> E. Y. > <\db-entry|+2X2sGxMCqQBxuWk|inproceedings|BoLeSc:2003:tellegen> <|db-entry> G. É. > <\db-entry|+1LCfUIVm228oQhYU|inproceedings|GR11> <|db-entry> D. S. > <\associate|bib-bibliography> <\db-entry|+if2lDqI1qmMqAaI|inproceedings|CKL89> <|db-entry> E. Y. > <\db-entry|+2X2sGxMCqQBxuWk|inproceedings|BoLeSc:2003:tellegen> <|db-entry> G. É. > <\db-entry|+2E2lmFPQ71qX8kS|misc|TeXmacs:website> <|db-entry> > > <\db-entry|+qYhUkmR1lNMNvEW|inproceedings|vdH:rmodroots> <|db-entry> J. van der G. > <\db-entry|+qYhUkmR1lNMNvFr|inproceedings|vdH:graeffe-icms> <|db-entry> M. > J. J. H. M. T. de > <\db-entry|+qYhUkmR1lNMNv7a|article|Pro1795> <|db-entry> > <\db-entry|+qYhUkmR1lNMNuuj|inproceedings|BenOrTiwari1988> <|db-entry> P. > <\db-entry|+qYhUkmR1lNMNv29|inproceedings|HuangRao1996> <|db-entry> A. J. > <\db-entry|+1CdAZXTV2814h167|inproceedings|Kaltofen1985:stoc> <|db-entry> > <\db-entry|+qYhUkmR1lNMNv2d|inproceedings|KaltofenLakshmanWiley1990> <|db-entry> Y. N. J.-M. > <\db-entry|+qYhUkmR1lNMNv2g|article|KaltofenTrager1990> <|db-entry> B. M. > <\db-entry|+qYhUkmR1lNMNv2h|inproceedings|KaltofenYagati1988> <|db-entry> L. > <\db-entry|+qYhUkmR1lNMNv5M|article|MF96> <|db-entry> T. > <\db-entry|+1zXxMKe2u4Uc2Nj|inproceedings|DiazKaltofen1998> <|db-entry> E. > <\db-entry|+aBpkOnY2VUnXTuN|article|FrImKaLa88> <|db-entry> G. M. E. Y. > <\db-entry|+RbpjgzeC8oxQji|article|HoevenLecerf2013> <|db-entry> G. > Symbolic Comput.> <\db-entry|+qYhUkmR1lNMNvEH|article|vdH:spinpol> <|db-entry> G. > <\db-entry|+qYhUkmR1lNMNv23|inproceedings|HM16> <|db-entry> M. B. > E. V. X.-S. > <\db-entry|+qYhUkmR1lNMNv2Q|inproceedings|JaMo10> <|db-entry> M. > <\db-entry|+qYhUkmR1lNMNv8R|inproceedings|Roche2018> <|db-entry> > <\db-entry|+qYhUkmR1lNMNuv5|article|BGY80> <|db-entry> F. G. D. Y. Y. > <\db-entry|+qYhUkmR1lNMNv5h|inproceedings|Moe73> <|db-entry> > <\db-entry|+qYhUkmR1lNMNuxX|article|CZ81> <|db-entry> H. > <\db-entry|+qYhUkmR1lNMNvEY|article|vdH:dmodroots> <|db-entry> J. van der G. > <\db-entry|+qYhUkmR1lNMNuuu|inbook|Bern08> <|db-entry> > P. > <\db-entry|+qYhUkmR1lNMNuvZ|article|Blue70> <|db-entry> > <\db-entry|+qYhUkmR1lNMNvCo|inproceedings|vdH:tft> <|db-entry> > > <\db-entry|+qYhUkmR1lNMNv3x|inproceedings|Lar16> <|db-entry> > <\db-entry|+qYhUkmR1lNMNutw|article|ASU75> <|db-entry> K. J. D. > <\db-entry|+qYhUkmR1lNMNv9R|article|Shoup95> <|db-entry> > <\db-entry|+qYhUkmR1lNMNvFo|techreport|vdH:smul> <|db-entry> > > <\db-entry|+qYhUkmR1lNMNuzt|book|GaGe13> <|db-entry> J. > <\db-entry|+qYhUkmR1lNMNv0B|techreport|GGR20> <|db-entry> B. D. S. > > <\db-entry|+qYhUkmR1lNMNv7Q|article|Pol71> <|db-entry> > <\db-entry|+qYhUkmR1lNMNuzO|inproceedings|FLR98> <|db-entry> C.E. R.K. > <\db-entry|+qYhUkmR1lNMNuwf|book|CG00> <|db-entry> A. > <\db-entry|+qYhUkmR1lNMNuxY|book|CzaporGeddesLabahn1992> <|db-entry> K. G. > <\db-entry|+qYhUkmR1lNMNv5O|article|MG11> <|db-entry> T. > <\db-entry|+qYhUkmR1lNMNuzV|article|FPVCM09> <|db-entry> M. Y. S. J. > <\db-entry|+qYhUkmR1lNMNvA1|misc|Steel:private> <|db-entry> > <\db-entry|+qYhUkmR1lNMNuzA|article|FFLM20> <|db-entry> A. F. M. > <\db-entry|+qYhUkmR1lNMNvEL|article|vdH:simd> <|db-entry> G. G. > <\db-entry|+qYhUkmR1lNMNuzJ|article|FJ05> <|db-entry> S. G. > <\db-entry|+qYhUkmR1lNMNv7f|article|Pue05> <|db-entry> J. M. F. J. D. M. B. J. F. A. Y. K. R. W. N. > <\db-entry|+qYhUkmR1lNMNv1c|article|Har14> <|db-entry> > <\references> <\collection> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > <\auxiliary> <\collection> <\associate|bib> TeXmacs:website vdH:rmodroots vdH:graeffe-icms Pro1795 BenOrTiwari1988 CKL89 HuangRao1996 Kaltofen1985:stoc KaltofenLakshmanWiley1990 KaltofenTrager1990 KaltofenYagati1988 MF96 DiazKaltofen1998 FrImKaLa88 HoevenLecerf2013 vdH:spinpol HM16 JaMo10 Roche2018 BGY80 Moe73 BoLeSc:2003:tellegen CKL89 CZ81 vdH:rmodroots vdH:dmodroots vdH:graeffe-icms vdH:rmodroots Bern08 vdH:graeffe-icms vdH:rmodroots Blue70 Blue70 vdH:rmodroots vdH:rmodroots vdH:tft Lar16 vdH:tft ASU75 CZ81 Shoup95 vdH:smul GaGe13 ASU75 Blue70 GaGe13 GaGe13 GaGe13 Shoup95 GGR20 Pol71 FLR98 CG00 CzaporGeddesLabahn1992 GaGe13 MG11 CG00 FPVCM09 Blue70 GaGe13 vdH:tft CZ81 Shoup95 Steel:private FFLM20 vdH:simd FJ05 Pue05 Har14 <\associate|figure> |1>||C code for two FFTs over |\>.>|> |2>||Cilk C code for parallelizing |fft1>>|> |3>|> C code for computing |Q=|)>> in step . |> <\associate|table> |1>||Timings in CPU seconds for |p=7\2+1> using |s\>. >|> |2>||Timings in CPU seconds for |p=3\29\2+1> using |s\>. >|> |3>||Tangent Graeffe timings in CPU seconds for |p=5\2+1> for various |s>.>|> |4>||Real timings (in seconds) for |p=5\2+1> using |s\>. >|> <\associate|toc> |math-font-series||font-shape||1.Introduction> |.>>>>|> |math-font-series||font-shape||2.Root finding using the tangent Graeffe transform> |.>>>>|> |2.1.Graeffe transforms |.>>>>|> > |2.2.Root finding using tangent Graeffe transforms |.>>>>|> > |2.3.Heuristic root finding over smooth finite fields |.>>>>|> > |math-font-series||font-shape||3.Computing Graeffe transforms> |.>>>>|> |3.1.Reminders about discrete Fourier transforms |.>>>>|> > |3.2.Graeffe transforms of order two |.>>>>|> > |3.3.Graeffe transforms of power of two orders |.>>>>|> > |3.4.Graeffe transforms of arbitrary smooth orders |.>>>>|> > |3.5.Truncated Fourier transforms |.>>>>|> > |3.6.Taylor shifts |.>>>>|> > |3.7.Comparison with Cantor\UZassenhaus' algorithm |.>>>>|> > |math-font-series||font-shape||4.Implementing the tangent Graeffe algorithm> |.>>>>|> |4.1.Parallelizing the FFTs |.>>>>|> > |4.2.Step 4: the Taylor shift \ |.>>>>|> > |4.3.Step : the tangent Graeffe loop \ |.>>>>|> > |4.4.Step : parallelizing the evaluations \ |.>>>>|> > |4.5.Step : parallelizing the product tree multiplication algorithm |.>>>>|> > |math-font-series||font-shape||5.Timing Results> |.>>>>|> |math-font-series||font-shape||6.Conclusion> |.>>>>|> |math-font-series||font-shape||Appendix A.Magma code> |.>>>>|> |math-font-series||font-shape||Bibliography> |.>>>>|> <\links> <\collection> > > > > > > > > > > > >