<\body> ||<\author-address> de Mathématiques ( 425) CNRS, Université Paris-Sud 91405 Orsay Cedex France Email: >|>||> <\abstract> In previous work, we have introduced the technique of relaxed power series computations. With this technique, it is possible to solve implicit equations almost as quickly as doing the operations which occur in the implicit equation. Here ``almost as quickly'' means that we need to pay a logarithmic overhead. In this paper, we will show how to reduce this logarithmic factor in the case when the constant ring has sufficiently many >-th roots of unity. >>>>>>>>>>>> Let \{}> be an effective ring and consider two power series +f*z+\> and +g*z+\> in [[z]]>. In this paper we will be concerned with the efficient computation of the first coefficients of the product +h*z+\>. If the first coefficients of and are known beforehand, then we may use any fast multiplication for polynomials in order to achieve this goal, such as divide and conquer multiplication , which has a time complexity )>, or multiplication , which has a time complexity n*log log n)>. For certain computations, and most importantly the resolution of implicit equations, it is interesting to use so called ``relaxed algorithms'' which output the first coefficients of as soon as the first coefficients of and are known for each n>. This allows for instance the computation of the exponential of a series with =0> using the formula <\equation> g=f*g. More precisely, this formula shows that the computation of reduces to one differentiation, one relaxed product and one relaxed integration. Differentiation and relaxed integration being linear in time, it follows that terms of can be computed in time , where denotes the time complexity of relaxed multiplication. In , we proved the following theorem: <\theorem> There exists a relaxed multiplication algorithm of time complexity <\equation*> R(n)=O(M(n)*log n) and space complexity . In this paper, we will improve the time complexity bound in this theorem in the case when > admits >-th roots of unity for any \>. In section , we first reduce this problem to the case of ``semi-relaxed multiplication'', when one of the arguments is fixed and the other one relaxed. More precisely, let and be power series, such that is known up to order . Then a semi-relaxed multiplication algorithm computes the product up to order and outputs > as soon as ,\,f> are known, for all n>. In section , we show that the overhead in theorem can be reduced to )>. In section , the technique of section is further improved so as to yield an >)> overhead. In the sequel, we will use the following notations from : we denote by [[z]]\[z]\[[z]]> the set of truncated power series of order , like +\+f*z>. Given [[z]]> and i\j\n>, we will denote j>=f+\+f*z\[[z]]>. <\remark> An preprint of the present paper was published a few years ago. The current version includes a new section with implementation details, benchmarks and afew notes on how to apply similar ideas in the Karatsuba and Toom-Cook models. Another algorithm for semi-relaxed multiplication, based on the middle product , was also published before . <\remark> The exotic form >)> of the new complexity for relaxed multiplication might surprise the reader. It should be noticed that the time complexity of Toom-Cook's algorithm for polynomial multiplication has a similar complexity >)> . Indeed, whereas our algorithm from section has a Karatsuba-like flavour, the algorithm from section uses a generalized subdivision which is similar to the one used by Toom and Cook. An interesting question is whether even better time complexities can be obtained (in analogy with FFT-multiplication). However, we have not managed so far to reduce the cost of relaxed multiplication to or . Nevertheless, it should be noticed that the function >> grows very slowly; in practice, it very much behaves like as a constant (see section). <\remark> The reader may wonder whether further improvements in the complexity of relaxed multiplication are really useful, since the algorithms from are already optimal up to a factor . In fact, we expect fast algorithms for formal power series to be one of the building bricks for effective analysis . Therefore, even small improvements in the complexity of relaxed multiplication should lead to global speed-ups for this kind of software. In , we have stated several fast algorithms for relaxed multiplication. Let us briefly recall some of the main concepts and ideas. For details, we refer to. Throughout this section, and are two power series in [[z]]>. <\definition> We call <\equation> P=fn>*gn> the full product of and at order . <\definition> We call <\equation> P=n>(f*g)*z the truncated product of and at order . <\definition> A full (or truncated) zealous multiplication algorithm of and at order takes ,\,f> and ,\,g> on input and computes as in () ( ()). <\definition> A full (or truncated) relaxed multiplication algorithm of and at order successively takes the pairs ,g),\,(f,g)> on input and successively computes ,\,P> (resp. ,\,P>). Here it is understood that > is output as soon as ,g),\,(f,g)> are known. <\definition> A full (or truncated) semi-relaxed multiplication algorithm of and takes ,\,g> and the successive values ,\,f> on input and successively computes ,\,P> (resp. ,\,P>). Here it is understood that > is output as soon as ,\,f> are known. We will denote by , and the time complexities of full zealous, relaxed and semi-relaxed multiplication at order , where it is understood that the ring operations in > can be performed in time . We notice that full zealous multiplication is equivalent to polynomial multiplication. Hence, classical fast multiplication algorithms can be applied in this case . The main idea behind efficient algorithms for relaxed multiplication is to anticipate on future computations. More precisely, the computation of a full product () can be represented by an n> square with entries *g>, i,j\n>. As soon as ,\,f> and ,\,g> are known, it becomes possible to compute the contributions of the products *g> with j,k\i> to , even though the contributions of *g> with i> are not yet needed. The next idea is to subdivide the n> square into smaller squares, in such a way that the contribution of each small square to can be computed using a zealous algorithm. Now the contribution of such a small square is of the form \i>*g\j>*z+j>>. Therefore, the requirement +j\max(i,j)> suffices to ensure that the resulting algorithm will be relaxed. In the left hand image of figure , we have shown the subdivision from the main algorithm of , which has time complexity . <\big-figure|> Illustration of the facts that (1) a full relaxed 2*n> multiplication reduces to one full relaxed n> multiplication, two semi-relaxed n> multiplication and one zealous n> multiplication (2) a semi-relaxed 2*n> multiplication reduces to two semi-relaxed n> multiplications and two zealous n> multiplications. There is an alternative interpretation of the left hand image in figure : when interpreting the big square as a 2*n> multiplication <\equation*> P=f2*n>*g2*n>, we may regard it as the sum <\equation*> P=P+P*z+P*z+P*z of four n> multiplications <\eqnarray*> >||n>*gn>>>|>||n>*g2*n>>>|>||2*n>*gn>>>|>||2*n>*g2*n>.>>>> Now > is a relaxed multiplication at order , but > is even semi-relaxed, since ,\,g> are already known by the time that we need )>. Similarly, > corresponds to a semi-relaxed product and > to a zealous product. This shows that <\equation*> R(2*n)\R(n)+2*Q(n)+M(n). Similarly, we have <\equation*> Q(2*n)\2*Q(n)+2*M(n), as illustrated in the right-hand image of figure . Under suitable regularity hypotheses for and , the above relations imply: <\theorem> <\enumerate-alpha> If > is increasing, then . If > is increasing, then . A consequence of part () of the theorem is that it suffices to design fast algorithms for semi-relaxed multiplication in order to obtain fast algorithms for relaxed multiplication. This fact may be reinterpreted by observing that the fast relaxed multiplication algorithm actually applies Newton's method in a hidden way. Indeed, since Brent and Kung , it is well known that Newton's method can also be used in the context of formal power series in order to solve differential or functional equations. One step of Newton's method at order involves the recursive application of the method at order n/2\> and the resolution of a linear equation at order n/2\>. The resolution of the linear equation corresponds to the computation of the two semi-relaxed products. Assume from now on that > admits an -th root of unity > for every power of two 2>>. Given an element [[z]]>, let (f)\> denote its Fourier transform <\equation*> FFT(f)=(f(1),f(\),\,f(\)) and let :\(n)> be the inverse mapping of >. It is well known that both > and > can be computed in time . Furthermore, if [[z]]> are such that [[z]]>, then <\equation*> f*g=FFT(FFT(f)*FFT(g)), where the product in > is scalar multiplication ,\,a)*(b,\,b)=(a*b,\,a*b)>. Now consider a decomposition *n> with =2>> and =2>>. Then a truncated power series [z]> can be rewritten as a series <\equation*> fn>+f\2*n>*y+\+f-1)*n\n*n>*y-1> in [z]>[y]>>, where >>. This series may again be reinterpreted as a series (f)\[z]>[y]>>, and we have <\equation*> f*g=\(\(f)*\(g)), where :[z]>[y]\[z]> is the mapping which substitutes >> for . Also, the FFT-transform >:[z]>\>> may be extended to a mapping <\eqnarray*> [z]>[y]>|>|>[y]>>|+\+c*y>|>|(c)+\+FFT(c)*y>>>> for each , and similarly for its inverse >>. Now the formula <\equation*> f*g=\(FFT>(FFT>(\(f))*FFT>(\(g)))) yields a way to compute by reusing the Fourier transforms of the ``bunches of coefficients'' \(k+1)*n>> and \(l+1)*n>> many times. In the context of a semi-relaxed multiplication with fixed argument , the above scheme almost reduces the computation of an n> product with coefficients in > to the computation of an \n> product with coefficients in >>. The only problem which remains is that we can only compute >(f\(k+1)*n>)> when >,\,f-1>> are all known. Consequently, the products \(k+1)*n>*gn>> should be computed apart, using a traditional semi-relaxed multiplication. In other words, we have reduced the computation of a semi-relaxed n> product with coefficients in > to the computation of > semi-relaxed \n> products with coefficients in >, one semi-relaxed \(n-1)> product with coefficients in >> and -3> FFT-transforms of length >. This has been illustrated in figure . <\big-figure|> New decomposition of a semi-relaxed n> multiplication into > semi-relaxed \n> multiplications (the light regions) and one semi-relaxed \(n-1)> multiplication (the dark region) with FFT-ed coefficients in >>. In order to obtain an efficient algorithm, we may choose =\p/2\> and =\p/2\>: <\theorem> Assume that >> admits an -th root of unity for each 2>>. Then there exists a relaxed multiplication algorithm of time complexity )> and space complexity . <\proof> In view of section , it suffices to consider the case of a semi-relaxed product. Let denote the time complexity of the above method. Then we observe that <\eqnarray*> |>|*T(n)+2*n*T(n)+O(n*n*log n)>>||>|*T(n)+2*n*T(n)+O(n*log n).>>>> Taking =\p/2\>, =\p/2\> and )/2>, we obtain <\equation*> U(p)\U(\p/2\)+2*U(\p/2\)+O(p), from which we deduce that )> and )>. Similarly, the space complexity satisfies the bound <\equation*> S(n)\S(n)+2*n*S(n)+O(n)\(2*n+1)*S(n)+O(n). Setting )/2>, it follows that <\equation*> R(p)\(2+p/2\>>>)*R(\p/2\)+O(1) Consequently, and . More generally, if *\*n> with =2>,\,n=2>>, then we may reduce the computation of a semi-relaxed n> product with coefficients in > into the computation of <\itemize> >> semi-relaxed \n> products over > of the form \(k+1)*n>*gn>>; >+n-1)-1> FFT-transforms of length >; *n>> semi-relaxed \(n-1)> products over >>; *n>+n-1)-1> FFT-transforms of length *n>; *n*n>> semi-relaxed \(n-1)> products over *n>>; > -3> FFT-transforms of length >>; one semi-relaxed \(n-1)> product over *\*n>>. This computation is illustrated in . From the complexity point of view, it leads to the following theorem: <\big-figure|> Generalized decomposition of a semi-relaxed n> multiplication into layers. <\theorem> Assume that >> admits an -th root of unity for each 2>>. Then there exists a relaxed multiplication algorithm of time complexity >)> and space complexity >)>. <\proof> In view of theorem(), it suffices to consider the case of a semi-relaxed product. Denoting by the time complexity of the above method, we have <\equation> T(n)\>*T(n)+>*T(n)+\+>*T(n)+O(l*n*log n). Let <\equation*> U(p)=)|p*2>. Taking =\=n=2> in (), it follows for any that <\equation> U(l*p)\2*U(p) + O(l). Applying this relation times, we obtain <\equation> U(l)\2*U(1) + O(2*l)=O(2*l). For a fixed such that is an integer, we obtain <\equation> U(p)=O(2*l). The minimum of *l> is reached when its derivative cancels. This happens for <\equation*> l =\> Plugging this value into (), we obtain <\equation*> U(p)=O(\>). Substitution of finally gives the desired estimate <\equation> T(n) = O(n*log n*\>). In order to be painstakingly correct, we notice that we really proved () for of the form log p/log l\>> and () for of the form >. Of course, we may always replace and by larger values which do have this form. Since these replacements only introduce additional constant factors in the complexity bounds, the bound () holds for general . As to the space complexity , we have <\equation*> S(n)\S(n)+2*n*S(n)+\+2*n*\*n*S(n)+O(n). Let <\equation*> R(p)=)|2>. Taking =\=n=2>, it follows for any that <\equation*> R(l*p)\(2+C/2)*R(p) + O(1), for some fixed constant . Applying this bound times, we obtain <\equation*> R(l)\2+>*(R(1)+O(1)). For \>, this bound simplifies to <\equation*> R(l)=O(2). Taking and >> as above, it follows that <\equation*> R(p)=O(2>)=O(\>). Substitution of finally gives us the desired estimate <\equation*> S(n)=O(n*\>) for the space complexity. For similar reasons as above, the bound holds for general . We implemented the algorithm from section in the library . Instead of taking \n>, we took > small (with \{4,8,16,32}> in the FFT range up to >), and used a naive multiplication algorithm on the FFT-ed blocks. The reason behind this change is that > needs to be reasonably large in order to profit from the better asymptotic complexity of relaxed multiplication. In practice, the optimal choice of ,n)> is obtained by taking > quite small. Moreover, our implementation uses atruncated version of relaxed multiplication 4.4.2>. In particular, the use of naive multiplication on the FFT-ed blocks allows us to gain afactor at the top-level. For small values of >, we also replaced FFT transforms by ``Karatsuba transforms'': given a polynomial +\+f>*Z-1>>, we may form a polynomial ,\,Z)> in variables with coefficients ,\,i>=f+\+i*2>> for ,\,i\{0,1}>. Then the Karatsuba transform of is the vector ,\,z))\{0,1,\}>> of size >, where )=b>. We have both tested (truncated) relaxed and semi-relaxed multiplication for different types of coefficients on an Intel Xeon processor at with of memory. The results of our benchmarks can be found in tables and below. Our benchmarks start at the order where FFT multiplication becomes useful. Notice that working with orders in>> does not give us any significant advantage, because the top-level product on FFT-ed blocks is naive. In table, the choice of > as a function of has been optimized for complex double coefficients. No particular optimization effort was made for the coefficient types in table, and it might be possible to gain about on our timings. <\big-table||||||||||||||>|>|>>|>|>>>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>>>>> Timings in seconds for the computation of terms of the exponential of a given series using complex double coefficients. We both computed the exponential using a semi-relaxed and arelaxed product, corresponding to and . We also considered the ratios with the timings for a full FFT-product of two polynomials of degree > n>. <\big-table||||||||||>|>>|>>|>>|>>>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>|>>||||>>>>> Ratios for the computation of terms of the exponential of a given series using different types of coefficients. In the first two columns, we use > as our ground field, with +1>. In the last two columns, we compute with bit complex floats from the library. <\remark> It is instructive to compare the efficiencies of relaxed evaluation and Newton's method. For instance, the exponentiation algorithm from has a time complexity> 4*M(n)>>. Although this is better from an asymptotic point of view, the ratioM(n)> rarely reaches in our tables. Consequently, relaxed algorithms are often better. A similar phenomenon was already observed in4 and5>. It would be interesting to pursue the comparisons in view of some recent advances concerning Newton's method ; see also 5.2.1>. <\remark> Although the emphasis of this paper is on asymptotic complexity, the idea behind the new algorithms also applies in the Karatsuba and Toom-Cook models. In the latter case, we take > small (typically \{2,3,4}>) and use evaluation (interpolation) for polynomials of degree -1> (-2>) at -1> points. From an asymptotic point of view, this yields M(n)> for relaxed multiplication. Moreover, the approach naturally combines with the generalization of pair/odd decompositions, which also yields an optimal bound for truncated multiplications. In fact, we notice that truncated pair/odd Karatsuba multiplication is ``essentially relaxed'' . On the negative side, these theoretically fast algorithms have bad space complexities and they are difficult to implement. In order to obtain good timings, it seems to be necessary to use dedicated code generation at different (ranges of) orders , which can be done using the template mechanism. The current implementation in does not achieve the theoretical time complexity by far, because the recursive function calls suffer from too much overhead. We have shown how to improve the complexity of relaxed multiplication in the case when the coefficient ring admits sufficiently many >-th roots of unity. The improvement is based on reusing FFT-transforms of pieces of the multiplicands at different levels of the underlying binary splitting algorithm. The new approach has proved to be efficient in practice (see tables and). For further studies, it would be interesting to study the price of artificially adding >-th roots of unity, like in Schönhage-Strassen's algorithm. In practice, we notice that it is often possible, and better, to ``cut the coefficients into pieces'' and to replace them by polynomials over the complexified doubles > or > with +1>. However, this approach requires more implementation effort. We would like to thank the third referee for his detailed comments on the proof of theorem, which also resulted in slightly sharper bounds. <\bibliography|bib|alpha|all> <\bib-list|BCO+06> A.Bostan, F.Chyzak, F.Ollivier, B.Salvy, É. Schost, and A.Sedoglavic. Fast computation of power series solutions of systems of differential equation. preprint, april 2006. submitted, 13 pages. R.P. Brent and H.T. Kung. Fast algorithms for manipulating formal power series. , 25:581--595, 1978. D.G. Cantor and E.Kaltofen. On fast multiplication of polynomials over arbitrary algebras. , 28:693--701, 1991. S.A. Cook. . PhD thesis, Harvard University, 1966. J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex Fourier series. , 19:297--301, 1965. Guillaume Hanrot, Michel Quercia, and Paul Zimmermann. The middle product algorithm I. speeding up the division and square root of power series. , 14(6):415--438, 2004. Guillaume Hanrot and Paul Zimmermann. A long note on Mulders' short product. Research Report 4654, INRIA, December 2002. Available from hanrot/Papers/mulders.ps>. D.E. Knuth. , volume 2: Seminumerical Algorithms. Addison-Wesley, 3-rd edition, 1997. A.Karatsuba and J.Ofman. Multiplication of multidigit numbers on automata. , 7:595--596, 1963. Alexandre Sedoglavic. ; applications à l'étude des propriétés structurelles de systèmes différentiels algébriques en automatique>. PhD thesis, École polytechnique, 2001. A.Schönhage and V.Strassen. Schnelle Multiplikation grosser Zahlen. , 7:281--292, 1971. A.L. Toom. The complexity of a scheme of functional elements realizing the multiplication of integers. , 4(2):714--716, 1963. J.vander Hoeven. Lazy multiplication of formal power series. In W.W. Küchlin, editor, , pages 17--20, Maui, Hawaii, July 1997. J.vander Hoeven. Relax, but don't be too lazy. , 34:479--542, 2002. J.vander Hoeven et al. Mmxlib: the standard library for Mathemagix, 2002. . J.vander Hoeven. New algorithms for relaxed multiplication. Technical Report 2003-44, Université Paris-Sud, Orsay, France, 2003. J.vander Hoeven. Relaxed multiplication using the middle product. In Manuel Bronstein, editor, , pages 143--147, Philadelphia, USA, August 2003. J.vander Hoeven. Newton's method and FFT trading. Technical Report 2006-17, Univ. Paris-Sud, 2006. Submitted to JSC. J.vander Hoeven. On effective analytic continuation. Technical Report 2006-15, Univ. Paris-Sud, 2006. <\initial> <\collection> <\references> <\collection> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > <\auxiliary> <\collection> <\associate|bib> Kar63 Kn97 CT65 SS71 CK91 vdH:relax vdH:issac97 vdH:relax vdH:relax vdH:newrelax:pre HQZ04 vdH:issac03 Toom63b Cook66 Kn97 vdH:issac97 vdH:relax vdH:riemann vdH:issac97 vdH:relax vdH:relax Kar63 Toom63b Cook66 CT65 SS71 CK91 vdH:relax vdH:issac97 vdH:relax BK78 vdH:mml vdH:relax BK78 vdH:relax BCOSSS06 vdH:fnewton Sedo01 HaZi02 vdH:relax <\associate|figure> <\tuple|normal> Illustration of the facts that (1) a full relaxed |2*n\2*n> multiplication reduces to one full relaxed |n\n> multiplication, two semi-relaxed |n\n> multiplication and one zealous |n\n> multiplication (2) a semi-relaxed |2*n\2*n> multiplication reduces to two semi-relaxed |n\n> multiplications and two zealous |n\n> multiplications. > <\tuple|normal> New decomposition of a semi-relaxed |n\n> multiplication into |n/n> semi-relaxed |n\n> multiplications (the light regions) and one semi-relaxed |n\(n-1)> multiplication (the dark region) with FFT-ed coefficients in |||C>>>>. > <\tuple|normal> Generalized decomposition of a semi-relaxed |n\n> multiplication into |l=3> layers. > <\associate|table> <\tuple|normal> Timings in seconds for the computation of |n> terms of the exponential of a given series using complex double coefficients. We both computed the exponential using a semi-relaxed and a |->|-0.3em|>|0em||0em||>>relaxed product, corresponding to |Q(n)> and |R(n)>. We also considered the ratios with the timings |->|-0.3em|>|0em||0em||>>|M(n)> for a full FFT-product of two polynomials of degree ||\> n>. > <\tuple|normal> Ratios for the computation of |n> terms of the exponential of a given series using different types of coefficients. In the first two columns, we use |\> as our ground field, with |p=3*2+1>. In the last two columns, we compute with |256> bit complex floats from the |Mpfr> library. > <\associate|toc> |math-font-series||font-shape||1.Introduction> |.>>>>|> |math-font-series||font-shape||2.Full and semi-relaxed multiplication> |.>>>>|> |math-font-series||font-shape||3.A new algorithm for fast relaxed multiplication> |.>>>>|> |math-font-series||font-shape||4.Further improvements of the algorithm> |.>>>>|> |math-font-series||font-shape||5.Implementation details and benchmarks> |.>>>>|> |math-font-series||font-shape||6.Conclusion> |.>>>>|> |Acknowledgement |.>>>>|> > |math-font-series||font-shape||Bibliography> |.>>>>|>