Implementing the tangent Graefferoot finding method

Consider a polynomial function

over a field

given through a black box capable of evaluating

at points in

. The problem of sparse interpolation is to recover the representation of

in its usual form, as a linear combination

of monomials

. One popular approach to sparse interpolation is to evaluate

at points in a geometric progression. This approach goes back to work of Prony in the eighteen's century [29] and became well known after Ben-Or and Tiwari's seminal paper [2]. It has widely been used in computer algebra, both in theory [6, 19, 21–24, 28] and in practice [8, 9, 15, 16, 18, 20]; see [30] for a nice survey.

More precisely, if a bound

for the number of terms

is known, then we first evaluate

pairwise distinct points

, where

and

for all

. The generating function of the evaluations at

satisfies the identity

where

and

is of degree

. The rational function

can be recovered from

using fast Padé approximation [5, 27]. For well chosen points

, it is often possible to recover the exponents

from the values

. If the exponents

are known, then the coefficients

can also be recovered using a transposed version of fast multipoint interpolation [4, 6]. This leaves us with the question how to compute the roots

in an efficient way.

For practical applications in computer algebra, we usually have

, in which case it is most efficient to use a multi-modular strategy. This means that we rather work with coefficients in a finite field

, where

is a prime number that we are free to choose. It is well known that polynomial arithmetic over

can be implemented most efficiently using FFTs when the order

of the multiplicative group is smooth. In practice, this prompts us to choose

of the form

for some small

and such that

fits into a machine word.

The traditional way to compute roots of polynomials over finite fields is using Cantor and Zassenhaus' method [7]. In [12, 13], alternative algorithms were proposed for our case of interest when

is smooth. The fastest algorithm was based on the tangent Graeffe transform and it gains a factor

with respect to Cantor–Zassenhaus' method. The aim of the present paper is to report on a parallel implementation of this new algorithm and on a few improvements that allow for a further constant speed-up.

In section 2, we start by recalling generalities about the Graeffe transform and the heuristic root finding method based on the tangent Graeffe transform from [12]. In section 3, we present the main new theoretical improvements, which all rely on optimizations in the FFT-model for fast polynomial arithmetic. Our contributions are threefold:

We also indicate how to generalize our methods to Graeffe transforms of general orders in section 3.4.

Section 4 is devoted to our new implementation. We first implemented a sequential version of the tangent Graeffe method in C, with the optimizations from sections 3.2 and 3.3. We also started experimenting with a parallel implementation in Cilk C. Our sequential implementation confirms the major performance improvement with respect to Cantor–Zassenhaus' algorithm. We also observed the gain of a new factor of two when using the new optimizations. So far, we have achieved a parallel speed-up by a factor of 4.6 on an 8-core machine. Our implementation is freely available at:

2.Root finding using the tangent Graeffe transform

2.1.Graeffe transforms

The traditional Graeffe transform of a monic polynomial

of degree

is the unique monic polynomial

of degree

such that

More generally, given

, we define the Graeffe transform of order to be the unique monic polynomial

of degree

such that

2.2.Root finding using tangent Graeffe transforms

Let

be a formal indeterminate with

. Elements in

are called tangent numbers. They are of the form

with

and basic arithmetic operations go as follows:

The definitions from the previous subsection readily extend to coefficients in

instead of

. Given

, we call

the tangent Graeffe transform of

of order

. We have

Now assume that we have an efficient way to determine the roots

. For some polynomial

, we may decompose

2.3.Heuristic root finding over smooth finite fields

Assume now that

is a finite field, where

is a prime number of the form

for some small

. Assume also that

be a primitive element of order

for the multiplicative group of

Let

be as in the previous subsection. The tangent Graeffe method can be used to efficiently compute those

for which

is a single root of

. In order to guarantee that there are a sufficient number of such roots, we first replace

for a random shift

, and use the following heuristic:

For a random shift

and any

, the assumption ensures with probability at least

that

has at least

single roots.

Now take

to be the largest power of two such that

and let

. By construction, note that

. The roots

are all

-th roots of unity in the set

. We may thus determine them by evaluating

for

. Since

, this can be done efficiently using a discrete Fourier transform. Combined with the tangent Graeffe method from the previous subsection, this leads to the following probabilistic algorithm for root finding:

Algorithm 1

Input:	of degree and only order one factors,
Output:	the set of roots of

If then return
Let be a random shift

Let be largest such that and let
Compute
Compute
For , set
Write and compute , , and for
If , then set , else set
Let be of order

For any with and ,

let
Compute
Recursively determine the set of roots of
Return

Remark 1. To compute

we may use

, which requires three polynomial multiplications in

of degree

. In total, step 5 therefore performs

such multiplications. We discuss how to perform step 5 efficiently in the FFT model in section 3.

Remark 2. For practical implementations, one may vary the threshold

for

and the resulting threshold

for

. For larger values of

, the computations of the DFTs in step 6 get more expensive, but the proportion of single roots goes up, so more roots are determined at each iteration. From an asymptotic complexity perspective, it would be best to take

. In practice, we actually preferred to take the lower threshold

, because the constant factor of our implementation of step 6 (based on Bluestein's algorithm [3]) is significant with respect to our highly optimized implementation of the tangent Graeffe method. A second reason we prefer

of size

instead of

is that the total space used by the algorithm is linear in

. In the future, it would be interesting to further speed up step 6 by investing more time in the implementation of high performance DFTs of general orders

Remark 3. For the application to sparse interpolation, it is possible to further speed up step 5 for the top-level iteration, which is the most expensive step. More precisely, for a polynomial with

terms, the idea is to take

and

of order

instead of

for some constant

with

. This reduces

(and the cost of the top-level iteration) by a factor of

. For the recursive calls, we still need to work with a primitive root of unity

of order

and random shifts.

3.Computing Graeffe transforms

3.1.Reminders about discrete Fourier transforms

Assume that

is invertible in

and let

be a primitive

-th root of unity. Consider a polynomial

. Then the discrete Fourier transform (DFT) of order

of the sequence

is defined by

We will write

for the cost of one discrete Fourier transform in terms of the number of operations in

and assume that

. For any

, we have

is invertible in

, then it follows that

. The costs of direct and inverse transforms therefore coincide up to a factor

This shows that a DFT of length

reduces to

transforms of length

plus

transforms of length

plus

multiplications in

It is sometimes convenient to apply DFTs directly to polynomials as well; for this reason, we also define

. Given two polynomials

with

, we may then compute the product

using

In particular, if

denotes the cost of multiplying two polynomials of degree

, then we obtain

Remark 4. In Algorithm 1, we note that step 6 comes down to the computation of three DFTs of length

. Since

is a power of two, this length is of the form

for some

. In view of (6), we may therefore reduce step 6 to

DFTs of length

plus

DFTs of length

. If

is very small, then we may use a naive implementation for DFTs of length

. In general, one may use Bluestein's algorithm [3] to reduce the computation of a DFT of length

into the computation of a product in

, which can in turn be computed using FFT-multiplication and three DFTs of length a larger power of two.

3.2.Graeffe transforms of order two

Let

be a field with a primitive

-th root of unity

. Let

be a polynomial of degree

. Then the relation (2) yields

can be obtained from

using

transpositions of elements in

. Concerning the inverse transform, we also note that

Algorithm 2

Input:	with and a primitive -th root of unity
Output:

Compute
For , set
For , compute
Return

Proof. We have already explained the correctness of Algorithm 2. Step 1 requires one forward DFT of length

and cost

. Steps 2 and 3 can be done in linear time

. Step 4 requires one inverse DFT of length

and cost

. The total cost of Algorithm 2 is therefore

Remark 6. In terms of the complexity of multiplication, we obtain

This gives a

improvement over the previously best known bound

that was used in [12]. Note that the best known algorithm for computing squares of polynomials of degree

. It would be interesting to know whether squares can also be computed in time

3.3.Graeffe transforms of power of two orders

Now assume that we computed the first Graeffe transform

using Algorithm 2 and that we wish to apply a second Graeffe transform to the result. Then we note that

is already known for

. We can use this to accelerate step 1 of the second application of Algorithm 2. Indeed, in view of (6) for

and

, we have

for

. In order to exploit this idea in a recursive fashion, it is useful to modify Algorithm 2 so as to include

in the input and

in the output. This leads to the following algorithm:

Algorithm 3

Input:	with , a primitive -th root of unity ,
	and
Output:	and

Set
Set
For , set
For , compute
Return and

Proof. It suffices to compute

and then to apply Algorithm 3 recursively,

times. Every application of Algorithm 3 now takes

operations in

, whence the claimed complexity bound.

Remark 8. In [12], Graeffe transforms of order

were directly computed using the formula (9), using

operations in

. The new algorithm is twice as fast for large

3.4.Graeffe transforms of arbitrary smooth orders

The algorithms from subsections 3.2 and 3.3 readily generalize to Graeffe transforms of order

for arbitrary

, provided that we have an

-th root of unity

. For convenience of the reader, we specified the generalization of Algorithm 3 below, together with the resulting complexity bounds.

Algorithm 4

Input:	with , , a primitive -th root of unity ,
	and
Output:	and

Set
For , set
For and , set
For , compute
Return and

Remark 11. In our application to root finding, we are interested in the efficient computation of Graeffe transforms of high order

. In terms of the size

, it is instructive to observe that the “average cost”

is minimal for

. Whenever radix-3 DFTs can be implemented as efficiently as radix-2 DFTs, then it follows that it might be interesting to use Graeffe transforms of order three.

3.5.Truncated Fourier transforms

is a fixed finite field, then DFTs are most efficient for sizes

that divide

. For our root finding application, it is often convenient to take

, in which case

should be a power of two or three times a power of two. The truncated Fourier transform was developed for the multiplication of polynomials such that the degree of the product does not have a nice size

of this type. It turns out that we may also use it for the efficient computation of Graeffe transforms of polynomials of arbitrary degrees. Moreover, the optimizations from the previous subsections still apply.

Let us briefly describe how the truncated Fourier transform can be used for the computation of Graeffe transforms of power of two orders. With the notations from subsections 3.2 and 3.3, we assume that

is a power of two as well and that we wish to compute the Graeffe transform of a polynomial

of degree

with

. Let

denote the reversal of a binary number

bits. For instance,

and

. Then the truncated Fourier of

at order

is defined by

It has been shown in [14] that

and

can both be computed in time

. For generalizations to arbitrary radices, we refer to [25].

for

. This provides us with the required counterpart of (8) for retrieving

efficiently from

. The relation (10) also has a natural counterpart:

Algorithm 5

Input:	with ,
	a primitive -th root of unity , and
Output:	and

Set
Set
For , set
For , compute
Return and

Proof. Straightforward adaptation of the proof of Proposition 7, while using [14].

3.6.Taylor shifts

In step 3 of Algorithm 1, we still need an algorithm to compute the Taylor shift

. If the characteristic of

exceeds

, then it is (not so) well known [1, Lemma 3] that this can be reduced to a single polynomial multiplication of degree

using the following algorithm:

Algorithm 6

Input:	of degree and
Output:

Return

It is interesting to observe that Taylor shifts can still be computed in time

in small characteristic, as follows:

Algorithm 7

Input:	of degree and
Output:

Define for where
Rewrite with for
For , replace
Return

4.Implementation and benchmarks

We have implemented the tangent Graeffe root finding algorithm (Algorithm 1) in C with the optimizations presented in section 3. Our C implementation supports primes of size up to 63 bits. In what follows all complexities count arithmetic operations in

In Tables 1 and 2, the input polynomial

of degree

is constructed by choosing

distinct values

for

at random and creating

. We will use

, a smooth 63 bit prime. For this prime

One goal we have is to determine how much faster the Tangent Graeffe (TG) root finding algorithm is in practice when compared with the Cantor-Zassenhaus (CZ) algorithm which is implemented in many computer algebra systems. In Table 1 we present timings comparing our sequential implementation of the TG algorithm with Magma's implementation of the CZ algorithm.

The timings in Table 1 are sequential timings obtained on a a Linux server with an Intel Xeon E5-2660 CPU with 8 cores. In Table 1 the time in column “first” is for the first application of the TG algorithm (steps 1 – 9 of Algorithm 1 ) showing that it obtains about 69% of the roots. The time in column “total” is the total time for the algorithm. Columns step 5 , step 6 , and step 9 report the time spent in steps 5 , 6 , and 9 in Algorithm 1 and do not count time in the recursive call in step 10 .

The Magma column timing is for Magma's Factorization command. The timings for Magma version V2.25-3 suggest that Magma's CZ implementation involves a subalgorithm with quadratic asymptotic complexity. Indeed it turns out that the author of the code implemented all of the sub-quadratic polynomial arithmetic correctly, as demonstrated by the second set of timings for Magma in column V2.25-5, but inserted the

linear factors found into a list using linear insertion! Allan Steel of the Magma group identified and fixed the offending subroutine for Magma version V2.25-5. The timings show that TG is faster than CZ by a factor of 76.6 (=8.43/0.11) to 146.3 (=2809/19.2).

4.1.Implementation notes

To implement the Taylor shift

in step 3, we used the

method from [1, Lemma 3]. The better known method presented in [11] is

. For step 5 we use Algorithm 3 as presented. It has complexity

. To evaluate

and

in step 6 in

we used the Bluestein transformation [3]. In step 9 to compute the product

, for

roots, we used the

tree multiplication algorithm [11]. The division in step 10 is done in

with the fast division.

The sequential timings in Tables 1 and 2 show that steps 5, 6 and 9 account for about 91% of the total time. We parallelized these three steps as follows. For step 5, the two forward and two inverse FFTs are done in parallel. We also parallelized our radix 2 FFT by parallelizing recursive calls for size

and the main loop in blocks of size

as done in [26]. For step 6 there are three applications of Bluestein to compute

and

. We parallelized these (thereby doubling the overall space used by our implementation). The main computation in the Bluestein transformation is a polynomial multiplication of two polynomials of degree

. The two forward FFTs are done in parallel and the FFTs themselves are parallelized as for step 5. For the product in step 9 we parallelize the two recursive calls in the tree multiplication for large sizes and again, the FFTs are parallelized as for step 5.

To improve parallel speedup we also parallelized the polynomial multiplication in step 3 and the computation of the roots in step 8. Although step 8 is

, it is relatively expensive because of two inverse computations in

. Because we have not parallelized about 5% of the computation the maximum parallel speedup we can obtain is a factor of

. The best overall parallel speedup we obtained is a factor of 4.6=1465/307.7 for

In Cilk, for each recursive C subroutine we wish to parallelize, one first needs to make a parallel version of that routine. For large recursive calls, the parallel version does them in parallel. For smaller recursive calls, the parallel version needs to call the sequential version of the code to avoid the Cilk process management overhead. The cutoff for the size of the input for which we need to use the sequential code has to be determined by experiment.

Bibliography

[1]: A. V. Aho, K. Steiglitz, and J. D. Ullman. Evaluating polynomials on a fixed set of points. SIAM Journ. of Comp., 4:533–539, 1975.
[2]: M. Ben-Or and P. Tiwari. A deterministic algorithm for sparse multivariate polynomial interpolation. In STOC '88: Proceedings of the twentieth annual ACM symposium on Theory of computing, pages 301–309. ACM Press, 1988.
[3]: Leo I. Bluestein. A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics, 18(4):451–455, 1970.
[4]: A. Bostan, G. Lecerf, and É. Schost. Tellegen's principle into practice. In ISSAC '03: Proceedings of the 2003 International Symposium on Symbolic and Algebraic Computation, pages 37–44. ACM Press, 2003.
[5]: R. P. Brent, F. G. Gustavson, and D. Y. Y. Yun. Fast solution of Toeplitz systems of equations and computation of Padé approximants. J. Algorithms, 1(3):259–295, 1980.
[6]: J. Canny, E. Kaltofen, and Y. Lakshman. Solving systems of non-linear polynomial equations faster. In Proceedings of the ACM-SIGSAM 1989 International Symposium on Symbolic and Algebraic Computation, pages 121–128. ACM Press, 1989.
[7]: D. G. Cantor and H. Zassenhaus. A new algorithm for factoring polynomials over finite fields. Math. Comp., 36(154):587–592, 1981.
[8]: A. Díaz and E. Kaltofen. FOXFOX: a system for manipulating symbolic objects in black box representation. In ISSAC '98: Proceedings of the 1998 International Symposium on Symbolic and Algebraic Computation, pages 30–37. ACM Press, 1998.
[9]: T. S. Freeman, G. M. Imirzian, E. Kaltofen, and Y. Lakshman. DAGWOOD: a system for manipulating polynomials given by straight-line programs. ACM Trans. Math. Software, 14:218–240, 1988.
[10]: M. Frigo, C.E. Leisorson, and R.K. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of PLDI 1998, pages 212–223. ACM, 1998.
[11]: J. von zur Gathen and J. Gerhard. Modern computer algebra. Cambridge University Press, New York, 3rd edition, 2013.
[12]: B. Grenet, J. van der Hoeven, and G. Lecerf. Randomized root finding over finite fields using tangent Graeffe transforms. In Proc. ISSAC '15, pages 197–204. New York, NY, USA, 2015. ACM.
[13]: B. Grenet, J. van der Hoeven, and G. Lecerf. Deterministic root finding over finite fields using Graeffe transforms. AAECC, 27(3):237–257, 2016.
[14]: J. van der Hoeven. The truncated Fourier transform and applications. In J. Gutierrez, editor, Proc. ISSAC 2004, pages 290–296. Univ. of Cantabria, Santander, Spain, July 4–7 2004.
[15]: J. van der Hoeven and G. Lecerf. On the bit-complexity of sparse polynomial and series multiplication. J. Symbolic Comput., 50:227–254, 2013.
[16]: J. van der Hoeven and G. Lecerf. Sparse polynomial interpolation in practice. ACM Commun. Comput. Algebra, 48(3/4):187–191, 2015.
[17]: J. van der Hoeven et al. GNU TeXmacs. http://www.texmacs.org, 1998.
[18]: J. Hu and M. B. Monagan. A fast parallel sparse polynomial GCD algorithm. In S. A. Abramov, E. V. Zima, and X.-S. Gao, editors, Proc. ISSAC '16, pages 271–278. ACM, 2016.
[19]: M. A. Huang and A. J. Rao. Interpolation of sparse multivariate polynomials over large finite fields with applications. In SODA '96: Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms, pages 508–517. Philadelphia, PA, USA, 1996. Society for Industrial and Applied Mathematics.
[20]: M. Javadi and M. Monagan. Parallel sparse polynomial interpolation over finite fields. In M. Moreno Maza and J.-L. Roch, editors, PASCO '10: Proceedings of the 4th International Workshop on Parallel and Symbolic Computation, pages 160–168. ACM Press, 2010.
[21]: E. Kaltofen. Computing with polynomials given by straight-line programs I: greatest common divisors. In STOC '85: Proceedings of the Seventeenth Annual ACM Symposium on Theory of Computing, pages 131–142. ACM Press, 1985.
[22]: E. Kaltofen, Y. N. Lakshman, and J.-M. Wiley. Modular rational sparse multivariate polynomial interpolation. In ISSAC '90: Proceedings of the International Symposium on Symbolic and Algebraic Computation, pages 135–139. New York, NY, USA, 1990. ACM Press.
[23]: E. Kaltofen and B. M. Trager. Computing with polynomials given by black boxes for their evaluations: greatest common divisors, factorization, separation of numerators and denominators. J. Symbolic Comput., 9(3):301–320, 1990.
[24]: E. Kaltofen and L. Yagati. Improved sparse multivariate polynomial interpolation algorithms. In ISSAC '88: Proceedings of the International Symposium on Symbolic and Algebraic Computation, pages 467–474. Springer Verlag, 1988.
[25]: R. Larrieu. The truncated Fourier transform for mixed radices. In Proc. ISSAC '17, pages 261–268. New York, NY, USA, 2017. ACM.
[26]: M. Law and M. Monagan. A parallel implementation for polynomial multiplication modulo a prime. In Proc. of PASCO 2015, pages 78–86. ACM, 2015.
[27]: R. Moenck. Fast computation of GCDs. In Proc. of the 5th ACM Annual Symposium on Theory of Computing, pages 142–171. New York, 1973. ACM Press.
[28]: H. Murao and T. Fujise. Modular algorithm for sparse multivariate polynomial interpolation and its parallel implementation. JSC, 21:377–396, 1996.
[29]: R. Prony. Essai expérimental et analytique sur les lois de la dilatabilité des fluides élastiques et sur celles de la force expansive de la vapeur de l'eau et de la vapeur de l'alkool, à différentes températures. J. de l'École Polytechnique Floréal et Plairial, an III, 1(cahier 22):24–76, 1795.
[30]: D. S. Roche. What can (and can't) we do with sparse polynomials? In C. Arreche, editor, ISSAC '18: Proceedings of the 2018 ACM International Symposium on Symbolic and Algebraic Computation, pages 25–30. ACM Press, 2018.
[31]: V. Shoup. A new polynomial factorization and its implementation. J. Symbolic Computation, 20(4):363–397, 1995.

1.Introduction

2.Root finding using the tangent Graeffe transform

2.1.Graeffe transforms

2.2.Root finding using tangent Graeffe transforms

2.3.Heuristic root finding over smooth finite fields

3.Computing Graeffe transforms

3.1.Reminders about discrete Fourier transforms

3.2.Graeffe transforms of order two

3.3.Graeffe transforms of power of two orders

3.4.Graeffe transforms of arbitrary smooth orders

3.5.Truncated Fourier transforms

3.6.Taylor shifts

4.Implementation and benchmarks

4.1.Implementation notes

Bibliography