
In previous work, we have introduced the technique of relaxed power series computations. With this technique, it is possible to solve implicit equations almost as quickly as doing the operations which occur in the implicit equation. Here “almost as quickly” means that we need to pay a logarithmic overhead. In this paper, we will show how to reduce this logarithmic factor in the case when the constant ring has sufficiently many th roots of unity.

Let be an effective ring and consider two power series and in . In this paper we will be concerned with the efficient computation of the first coefficients of the product .
If the first coefficients of and are known beforehand, then we may use any fast multiplication for polynomials in order to achieve this goal, such as divide and conquer multiplication [KO63, Knu97], which has a time complexity , or F.F.T. multiplication [CT65, SS71, CK91, vdH02a], which has a time complexity .
For certain computations, and most importantly the resolution of implicit equations, it is interesting to use so called “relaxed algorithms” which output the first coefficients of as soon as the first coefficients of and are known for each . This allows for instance the computation of the exponential of a series with using the formula
(1) 
More precisely, this formula shows that the computation of reduces to one differentiation, one relaxed product and one relaxed integration. Differentiation and relaxed integration being linear in time, it follows that terms of can be computed in time , where denotes the time complexity of relaxed multiplication. In [vdH97, vdH02a], we proved the following theorem:
and space complexity .
In this paper, we will improve the time complexity bound in this theorem in the case when admits th roots of unity for any . In section 2, we first reduce this problem to the case of “semirelaxed multiplication”, when one of the arguments is fixed and the other one relaxed. More precisely, let and be power series, such that is known up to order . Then a semirelaxed multiplication algorithm computes the product up to order and outputs as soon as are known, for all . In section 3, we show that the overhead in theorem 1 can be reduced to . In section 4, the technique of section 3 is further improved so as to yield an overhead.
In the sequel, we will use the following notations from [vdH02a]: we denote by the set of truncated power series of order , like . Given and , we will denote .
Remark
Remark
An interesting question is whether even better time complexities can be obtained (in analogy with FFTmultiplication). However, we have not managed so far to reduce the cost of relaxed multiplication to or . Nevertheless, it should be noticed that the function grows very slowly; in practice, it very much behaves like as a constant (see section 5).
Remark
In [vdH97, vdH02a], we have stated several fast algorithms for relaxed multiplication. Let us briefly recall some of the main concepts and ideas. For details, we refer to [vdH02a]. Throughout this section, and are two power series in .
(2) 
the full product of and at order .
(3) 
the truncated product of and at order .
We will denote by , and the time complexities of full zealous, relaxed and semirelaxed multiplication at order , where it is understood that the ring operations in can be performed in time . We notice that full zealous multiplication is equivalent to polynomial multiplication. Hence, classical fast multiplication algorithms can be applied in this case [KO63, Too63, Coo66, CT65, SS71, CK91, vdH02a].
The main idea behind efficient algorithms for relaxed multiplication is to anticipate on future computations. More precisely, the computation of a full product (2) can be represented by an square with entries , . As soon as and are known, it becomes possible to compute the contributions of the products with to , even though the contributions of with are not yet needed. The next idea is to subdivide the square into smaller squares, in such a way that the contribution of each small square to can be computed using a zealous algorithm. Now the contribution of such a small square is of the form . Therefore, the requirement suffices to ensure that the resulting algorithm will be relaxed. In the left hand image of figure 1, we have shown the subdivision from the main algorithm of [vdH97, vdH02a], which has time complexity .
There is an alternative interpretation of the left hand image in figure 1: when interpreting the big square as a multiplication
we may regard it as the sum
of four multiplications
Now is a relaxed multiplication at order , but is even semirelaxed, since are already known by the time that we need . Similarly, corresponds to a semirelaxed product and to a zealous product. This shows that
Similarly, we have
as illustrated in the righthand image of figure 1. Under suitable regularity hypotheses for and , the above relations imply:
A consequence of part (b) of the theorem is that it suffices to design fast algorithms for semirelaxed multiplication in order to obtain fast algorithms for relaxed multiplication. This fact may be reinterpreted by observing that the fast relaxed multiplication algorithm actually applies Newton's method in a hidden way. Indeed, since Brent and Kung [BK78], it is well known that Newton's method can also be used in the context of formal power series in order to solve differential or functional equations. One step of Newton's method at order involves the recursive application of the method at order and the resolution of a linear equation at order . The resolution of the linear equation corresponds to the computation of the two semirelaxed products.
Assume from now on that admits an th root of unity for every power of two . Given an element , let denote its Fourier transform
and let be the inverse mapping of . It is well known that both and can be computed in time . Furthermore, if are such that , then
where the product in is scalar multiplication .
Now consider a decomposition with and . Then a truncated power series can be rewritten as a series
in , where . This series may again be reinterpreted as a series , and we have
where is the mapping which substitutes for . Also, the FFTtransform may be extended to a mapping
for each , and similarly for its inverse . Now the formula
yields a way to compute by reusing the Fourier transforms of the “bunches of coefficients” and many times.
In the context of a semirelaxed multiplication with fixed argument , the above scheme almost reduces the computation of an product with coefficients in to the computation of an product with coefficients in . The only problem which remains is that we can only compute when are all known. Consequently, the products should be computed apart, using a traditional semirelaxed multiplication. In other words, we have reduced the computation of a semirelaxed product with coefficients in to the computation of semirelaxed products with coefficients in , one semirelaxed product with coefficients in and FFTtransforms of length . This has been illustrated in figure 2.
In order to obtain an efficient algorithm, we may choose and :
Proof. In view of section 2, it suffices to consider the case of a semirelaxed product. Let denote the time complexity of the above method. Then we observe that
Taking , and , we obtain
from which we deduce that and . Similarly, the space complexity satisfies the bound
Setting , it follows that
Consequently, and .
More generally, if with , then we may reduce the computation of a semirelaxed product with coefficients in into the computation of
semirelaxed products over of the form ;
FFTtransforms of length ;
semirelaxed products over ;
FFTtransforms of length ;
semirelaxed products over ;
FFTtransforms of length ;
one semirelaxed product over .
This computation is illustrated in 3. From the complexity point of view, it leads to the following theorem:
Proof. In view of theorem 10(b), it suffices to consider the case of a semirelaxed product. Denoting by the time complexity of the above method, we have
(4) 
Let
Taking in (4), it follows for any that
(5) 
Applying this relation times, we obtain
(6) 
For a fixed such that is an integer, we obtain
(7) 
The minimum of is reached when its derivative w.r.t. cancels. This happens for
Plugging this value into (7), we obtain
Substitution of finally gives the desired estimate
(8) 
In order to be painstakingly correct, we notice that we really proved (7) for of the form and (8) for of the form . Of course, we may always replace and by larger values which do have this form. Since these replacements only introduce additional constant factors in the complexity bounds, the bound (8) holds for general .
As to the space complexity , we have
Let
Taking , it follows for any that
for some fixed constant . Applying this bound times, we obtain
For , this bound simplifies to
Taking and as above, it follows that
Substitution of finally gives us the desired estimate
for the space complexity. For similar reasons as above, the bound holds for general .
We implemented the algorithm from section 3 in
the
Moreover, our implementation uses a truncated version of relaxed multiplication [vdH02a, Section 4.4.2]. In particular, the use of naive multiplication on the FFTed blocks allows us to gain a factor at the toplevel. For small values of , we also replaced FFT transforms by “Karatsuba transforms”: given a polynomial , we may form a polynomial in variables with coefficients for . Then the Karatsuba transform of is the vector of size , where .
We have both tested (truncated) relaxed and semirelaxed multiplication for different types of coefficients on an Intel Xeon processor at with of memory. The results of our benchmarks can be found in tables 1 and 2 below. Our benchmarks start at the order where FFT multiplication becomes useful. Notice that working with orders in does not give us any significant advantage, because the toplevel product on FFTed blocks is naive. In table 1, the choice of as a function of has been optimized for complex double coefficients. No particular optimization effort was made for the coefficient types in table 2, and it might be possible to gain about on our timings.




Remark
Remark
On the negative side, these theoretically fast algorithms have bad space
complexities and they are difficult to implement. In order to obtain
good timings, it seems to be necessary to use dedicated code generation
at different (ranges of) orders ,
which can be done using the
We have shown how to improve the complexity of relaxed multiplication in the case when the coefficient ring admits sufficiently many th roots of unity. The improvement is based on reusing FFTtransforms of pieces of the multiplicands at different levels of the underlying binary splitting algorithm. The new approach has proved to be efficient in practice (see tables 1 and 2).
For further studies, it would be interesting to study the price of artificially adding th roots of unity, like in SchönhageStrassen's algorithm. In practice, we notice that it is often possible, and better, to “cut the coefficients into pieces” and to replace them by polynomials over the complexified doubles or with . However, this approach requires more implementation effort.
A. Bostan, F. Chyzak, F. Ollivier, B. Salvy, É. Schost, and A. Sedoglavic. Fast computation of power series solutions of systems of differential equation. preprint, april 2006. submitted, 13 pages.
R.P. Brent and H.T. Kung. Fast algorithms for manipulating formal power series. Journal of the ACM, 25:581–595, 1978.
D.G. Cantor and E. Kaltofen. On fast multiplication of polynomials over arbitrary algebras. Acta Informatica, 28:693–701, 1991.
S.A. Cook. On the minimum computation time of functions. PhD thesis, Harvard University, 1966.
J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Computat., 19:297–301, 1965.
Guillaume Hanrot, Michel Quercia, and Paul Zimmermann. The middle product algorithm I. speeding up the division and square root of power series. AAECC, 14(6):415–438, 2004.
Guillaume Hanrot and Paul Zimmermann. A long note on Mulders' short product. Research Report 4654, INRIA, December 2002. Available from http://www.loria.fr/ hanrot/Papers/mulders.ps.
D.E. Knuth. The Art of Computer Programming, volume 2: Seminumerical Algorithms. AddisonWesley, 3rd edition, 1997.
A. Karatsuba and J. Ofman. Multiplication of multidigit numbers on automata. Soviet Physics Doklady, 7:595–596, 1963.
Alexandre Sedoglavic. Méthodes seminumériques en algèbre différentielle ; applications à l'étude des propriétés structurelles de systèmes différentiels algébriques en automatique. PhD thesis, École polytechnique, 2001.
A. Schönhage and V. Strassen. Schnelle Multiplikation grosser Zahlen. Computing 7, 7:281–292, 1971.
A.L. Toom. The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Mathematics, 4(2):714–716, 1963.
J. van der Hoeven. Lazy multiplication of formal power series. In W. W. Küchlin, editor, Proc. ISSAC '97, pages 17–20, Maui, Hawaii, July 1997.
J. van der Hoeven. Relax, but don't be too lazy. JSC, 34:479–542, 2002.
J. van der Hoeven et al. Mmxlib: the standard library for Mathemagix, 2002. http://www.mathemagix.org/mml.html.
J. van der Hoeven. New algorithms for relaxed multiplication. Technical Report 200344, Université ParisSud, Orsay, France, 2003.
J. van der Hoeven. Relaxed multiplication using the middle product. In Manuel Bronstein, editor, Proc. ISSAC '03, pages 143–147, Philadelphia, USA, August 2003.
J. van der Hoeven. Newton's method and FFT trading. Technical Report 200617, Univ. ParisSud, 2006. Submitted to JSC.
J. van der Hoeven. On effective analytic continuation. Technical Report 200615, Univ. ParisSud, 2006.