<\body> ||<\author-address> de Mathématiques ( 425) CNRS, Université Paris-Sud 91405 Orsay Cedex France Email: >||>|-multiplication, multivariate polynomials, multivariate power series.>|> <\abstract> In a previous paper , we introduced a truncated version of the classical Fast Fourier Transform. When applied to polynomial multiplication, this algorithm has the nice property of eliminating the ``jumps'' in the complexity at powers of two. When applied to the multiplication of multivariate polynomials or truncated multivariate power series, a non-trivial asymptotic factor was gained with respect to the best previously known algorithms. In the present note, we correct two errors which slipped into the previous paper and we give a new application to the multiplication of polynomials with real coefficients. We also give some further hints on how to implement the TFT in practice. >>>>>>>>>> Let \1/2> be an effective ring of constants ( the usual arithmetic operations >, > and >> can be carried out by algorithm). If > has a primitive -th root of unity with >, then the product of two polynomials \> with n> can be computed in time > using the Fast Fourier Transform or . If > does not admit a primitive -th root of unity, then one needs an additional overhead of > in order to carry out the multiplication, by artificially adding new root of unity . Besides the fact that the asymptotic complexity of the involves a large constant factor, another classical drawback is that the complexity function admits important jumps at each power of two. These jumps can be reduced by using |)>>-th roots of unity for small . They can also be smoothened by decomposing |)>\|)>>-multiplications as n>-, \>- and |)>\\>-multiplications. However, these tricks are not very elegant, cumbersome to implement, and they do not allow to completely eliminate the jump problem. The jump phenomenon becomes even more important for -dimensional FFTs, since the complexity is multiplied by > whenever the degree traverses a power of two. In , the author introduced a new kind of ``Truncated Fourier Transform''(), which allows for the fast evaluation of a polynomial \> in any number of well-chosen roots of unity. This algorithm coincides with the usual if is a power of two, but it behaves smoothly for intermediate values. Moreover, the inverse TFT can be carried out with the same complexity and the approach generalizes to higher dimensions. Unfortunately, two errors slipped into the final version of : in the multivariate TFT, we forgot certain crossings. As a consequence, the complexity bounds for multiplying polynomials (and power series) in variables of total degree > n> only holds when >. Moreover, the inverse TFT does not generalize to arbitrary ``unions of intervals''. The present note has several purposes: correcting the above mistakes in , providing further details on how to implement the FFT and TFT in a sufficiently generic way (and suitable for univariate as well as multivariate computation) and a new extension to the case of TFTs with real coefficients. Furthermore, a generic implementation of theFFT and TFT is in progress as part of the standard C++ library of . We will present a few experimental results, together with suggestions for future improvements. More details will be provided in a forthcoming paper and we refer to for further references. Let > be an effective ring of constants, > with \> and \\> a primitive -th root of unity ( =\1>). The discrete Fast Fourier Transform () of an -tuple ,\,a|)>\\> (with respect to >) is the -tuple ,\,|)>=FFT>\\> with <\equation*> =a*\. In other words, =A|)>>, where \> denotes the polynomial +a*X+\+a*X>. We also say that ,\,|)>> is the FFT > of . The can be computed efficiently using binary splitting: writing <\equation*> ,\,a|)>=,c,\,b,c|)>, we recursively compute the Fourier transforms of ,\,b|)>> and ,\,c|)>> <\eqnarray*> >,\,b|)>>||,\,|)>;>>|>,\,c|)>>||,\,|)>.>>>> Then we have <\eqnarray*> >,\,a|)>>||+,\,+*\>>|||-,\,-*\).>>>> This algorithm requires n> multiplications with powers of > and additions (or subtractions). In practice, it is most efficient to implement an in-place variant of the above algorithm. We will denote by > the bitwise mirror of at length (for instance, =24> and =26>). At step , we start with the vector <\equation*> x=,\,x|)>=,\,a|)>. At step ,p|}>>, we set <\equation> +j>>>|*m+j>>>>>>=|*m>>>||\*m>>>>>>+j>>>|*m+j>>>>>>. for all ,n/m-2|}>> and ,m-1|}>>, where =2>. Using induction over , it can easily be seen that <\equation*> x+j>=>>,a+j>,\,a+j>|)>|)>>, for all ,n/m-1|}>> and ,m-1|}>>. In particular, <\eqnarray*> >||>>>|>||>>>>> for all ,n-1|}>>. This algorithm of ``repeated crossings'' is illustrated in figure . <\big-figure|> Schematic representation of a Fast Fourier Transform for . The black dots correspond to the >, the upper row being ,\,x|)>=,\,a|)>> and the lower row ,\,x|)>=,,,,\,|)>>. A classical application of the is the multiplication of polynomials +\+a*X> and +\+b*X>. Assuming that n>, we first evaluate and in ,\,\> using the : <\eqnarray*> ,\,A|)>|)>>||>,\,a|)>>>|,\,B|)>|)>>||>,\,b|)>>>>> We next compute the evaluations *B,\,A|)>*B|)>|)>)> of at ,\>. We finally have to recover from these values using the inverse . But the inverse with respect to > is nothing else as times the direct with respect to >. Indeed, for all ,\,a|)>\\> and all ,n-1|}>>, we have <\equation> FFT1>>>|)>=a*\*j>=n*a, since <\equation*> \*j>=0 whenever k>. This yields a multiplication algorithm of time complexity > in >, when assuming that > admits enough primitive >-th roots of unity. In the case that > does not, then new roots of unity can be added artificially so as to yield an algorithm of time complexity >. Given a multivariate polynomial K,\,X|]>> with > A\n=2>> for all , we may also consider its FFT with respect to each of its variables: <\equation*> ,\,i>=A>>,\,\>>|)>, where >> is an >-th root of unity for each . Usually the >> are chosen such that =\>, =\>, =\>, etc. The multivariate FFT is simply computed by taking the univariate FFT with respect to each variable. Instead of using multi-indices ,\,i|)>> for multivariate FFTs, it is more efficient to encode and > using a single array of size *\*n=2>. This can be done in several ways. Assume that we have ordered \\\n>. The of the multi-index,\,i|)>> is given by <\equation*> lex,\,n>,\,i|)>=i+i*n+i*n*n+\+i*n*\*n. The of,\,i|)>> is recursively defined by <\equation*> sim,\,n>,\,i|)>=hi|)>*2+\+hi|)>*2+sim/2,\,n>/2>|)>,\,lo>|)>|)>, where |)>=i div 2-1>>, |)>=i mod 2-1>> and > is maximal with >\4>. For each p>, we denote by > the maximal number with >\2>. Using one of these encodings, one may use the above in-place algorithm for the computation of the multivariate FFT, when replacing the crossing relation by <\equation> +j>>>|*m+j>>>>>>=|\*m>>>||\*m>>>>>>+j>>>|*m+j>>>>>>. Here <\equation*> \,\,\|)>=,\,\>,\,\,\>,\,\,\,\>|)> in case of the lexicographical encoding and <\equation*> \,\,\|)>=,|d\>,\,\,|d\>,\,\,\>,|d-1>\>,\>|)> in case of the simultaneous encoding. Translated back into terms of monomials, crossings at stage involve monomials and >, where <\equation*> ,\,Y|)>=/2>,\,X,X/2>,\,X,\,X/2>,\,X|)> in case of the lexicographical encoding and <\equation*> ,\,Y|)>=>>/2>,\,X/2>,X>>/4>,\,X>/4>,\,X-1>>,\,X|)> in case of the simultaneous encoding. A challenge for concrete implementations of the FFT in computer algebra systems is to achieve both optimal speed and genericity. Also, we would like to allow for univariate as well as multivariate FFTs. A first attempt in this direction is now part of , the standard C++ library of . The genericity is obtained using templates. Let us comment some aspects of our implementation. There exist at least three important models for the computation with roots of unity during the FFT (see also the explanations and discussion at the end of this section): <\enumerate> The Schönhage-Strassen model. The nice prime number model. The floating-point model. In our implementation, we have created a special template class C\> for roots of unity in , which may be specialized to any of the above models. The class C\> essentially implements methods for multiplication and multiplication with elements of. All necessary data for an FFT of length > are stored in a class C\>. This comprises the roots of unity \,\,\> for the cross relations() and precomputations of ,\,\-1>>, where=2>> is the highest order among one of the \>. These data will be called the . Of course, the precomputation of ,\,\-1>> requires some additional space. Nevertheless, this does not harm if > is significantly smaller than (say \n/4>) or when the encoding of roots of unity requires significantly less space than storing elements of (for instance, in the Schönhage-Strassen model, it suffices to store the shifts). In the remaining case, some space may be saved by precomputing only ,\,\-4>> and by computing the remaining values only on demand during the two last steps. It should also be noticed that we really precompute the array <\equation> \>>,\>>,\>>,\>>,\>>,\>>,\,\-2|]>>>,\-2|]>>> There are several reasons for doing so. First of all, we really need the binary mirrored power>> in the cross relation (). Secondly, >>=-\>>> is easily deduced from>>>. Finally, precomputation of the >>> is useful for the inverse transform. The core of the FFT is an efficient implementation of one FFT step, carrying out the cross relations () for +j\,n-1|}>> and fixed . More generally, for the purpose of preservation of locality (see below) and the TFT (see next sections), it is interesting to allow for ranges +j\,\,*2-1|}>> with p> and ,2-1|}>>. This key step should be massively inlined and loop-unrolled. For special types and C\>, the operations in and C\> may be written in assembler. Notice that the \*m>> involved in the cross-relations correspond to a subarray of(). For very large , the cross relations () typically involve data which are stored at locations which differ by a multiple of the cache size. Consequently, the FFT may give rise to a lot of cache misses and swapping. One remedy to this problem is to divide the FFT in two (or more) parts. The first part regroups the first |p/2|\>> steps. For each ,n-1|}>>, we collect the data ,a,\,a> with > in an array stored in contiguous, and perform the FFT on this array. After putting the results back at the original places, we proceed as usual for the remaining |p/2|\>> steps. At present, this strategy has not yet been implemented in , since we noticed no substantial slow-down due to cache misses on our computer. However, when memory fills up, we did observe significant swapping, so the above trick might be used in order to obtain a good performance even in critical cases. Also, we currently only tested our implementation with coefficients modulo 2+1>. It may be that the cost of cache misses is negligible the cost of multiplications and divisions modulo this number. In the Schönhage-Strassen model, we recall that multiplications with roots of unity simply correspond to shiftings, which leads to almost optimal speed. However, the inner FFT multiplication step usually becomes expensive in this model. Indeed, in order to obtain a large number of roots of unity, one has to allow for large constants in . Nevertheless, given a number -1> which fits into a machine word, it should be noticed that Schönhage-Strassen's method still provides -th roots of unity. Therefore, the inner multiplication step is efficient for certain higher dimensionalFFTs. Moreover, in this case, one may exploit the fact that multiplications modulo -1> are fast. On the other hand, the nice prime number and floating-point models admit many roots of unity, but genuine multiplications (and divisions) have to be carried out during the FFT step. In the nice prime number model, one computes modulo a prime number like 2+1>, which has many >-th roots of unity, yet still fits into a single machine word. One may also use several such prime numbers, in combination with Chinese remaindering. In the floating-point model, one uses floating point approximations of complex numbers. This is particularly useful if fast floating point arithmetic is available, but one always has to carefully deal with rounding errors. In practice, it follows that the choice of an optimal model for the FFT depends very much on the application. When using the FFT for multiplication, a general rule of the dumb is to balance the costs of the FFT-step and the inner multiplication step. If the inner multiplication step is carried out just once, then Schönhage-Strassen's model is usually optimal. This is for instance the case when multiplying two large integers. If one FFT corresponds to several inner multiplications, as in the case of matrix multiplication with large integer coefficients, one should opt for the nice prime or floating-point model, depending on the efficiency of division, floating-point operations and the available amount of space. In between, it may sometimes be useful to use Schönhage-Strassen's method with cube roots (instead of square roots) of the number of digits (or degree) . Similarly, one may force an extra iteration ( immediately use fourth roots, but loose an additional factor). Finally, in the case of multivariate FFTs (we consider that large integer coefficients account for an additional dimension), one may artificially generate additional roots of unity. For instance, if > has a >-th root of unity >, then one may multiply bivariate polynomials in />-1,y>-1|)>>, by using as a >-th root of > during the FFT . Let >, ,n|}>> and let \\> be a primitive -th root of unity. The Truncated Fourier Transform (TFT) from 3> takes a tuple ,\,a|)>> on input and produces ,\,|)>> with =>> for all . More generally, if > is a subset of ,n-1|}>> and \\\>, then we define =TFT,\>> by <\eqnarray*> :\>|>|>>||>|=>=\>a*\>>>>> Even more generally, one may select a target subset > of ,n-1|}>> which is different from >, and define =TFT\\,\>> by <\eqnarray*> :\>|>|>>||>|=>=\>a*\>>>>> In order to compute the TFT, we simply consider the computation graph of a completeFFT and throw out all computations which do not contribute to the result > (see figures , and ). If the set > is ``sufficiently connected'', then the cost of the computation of > is |\|>*+|\|>|)>*p|)>>. For instance, for =,l-1|}>>, we have proved : <\theorem> Let >, n> and let \\> be a primitive -th root of unity in >. Then the TFT of an -tuple ,\,a|)>> > can be computed using at most additions (or subtractions) and |/2|\>> multiplications with powers of >. \; <\big-figure|> Schematic representation of a for and . <\big-figure|> Schematic representation of a for and =>. <\big-figure|> Schematic representation of an asymmetric for with => and =>. <\section> Inverting the Truncated Fourier Transform In order to use the TFT for the multiplication of numbers, polynomials or power series, we also need an algorithm to compute the inverse transform. In this section, we give such an algorithm in the case when =\> and when > is an initial segment for the (partial) ``bit ordering >>'' on ,n-1|}>>: given numbers +i*2+\+i*2> and +j*2+\+j*2> (with ,j\>), we set j> if \j> for all . We recall that an initial segment of ,n-1|}>> is a set > with \\b\a\b\\>. The inverse TFT is based on the key observation that the cross relation can also be applied to compute ``one diagonal from the other diagonal''. More precisely, given a relation <\eqnarray*> >|>>>>>|||>>||\>>>>>>|>>>>,>>>> where =\> for some , we may clearly compute > as a function of > and vice versa <\eqnarray*> >|>>>>>||1>*|>|1>>|\1>>>>>>>|>>>>,>>>> but we also have <\eqnarray*> |>|>>>>>|||2*\>>||\>>>>>>|>>>>>>|>|>>>>>||1>|>|\1>>|1>>>>>>>|>>>>>>>> Moreover, these relations only involve shifting (multiplication and division by ), additions, subtractions and multiplications by roots of unity. In order to state our in-place algorithm for computing the inverse TFT, we will need some more notations. At the very start, we have :\\\> and |\>\\> on input, where |\>> is the complement of > in =,n-1|}>> and is the zero function. Roughly speaking, the algorithm will replace > by its inverse TFT and by its direct TFT >. In the actual algorithm, the array will be stored in a special way (see the next section) and only a part of the computation of > will really be carried out. Our algorithm proceeds by recursion. In the recursive step, > is replaced by a subset of ,n-1|}>> of the form ,\,*2-1|}>>, where is the current step and ,2-1|}>>. The sets > and |\>> will again be subsets of > with =\\|\>>, and -i*2> is recursively assumed to be an initial segment for the bit ordering. Given a subset > of >, we will denote >=\\\>> and >=\\\>>, where <\eqnarray*> >>||,\,*2-1|}>>>|>>||*2,\,*2-1|}>.>>>> Similarly, given \\>, we will denote by >> and >> the restrictions of to >> >>. We now have the following recursive in-place algorithm for computing the inverse TFT (see figure for an illustration of the different steps). <\named-algorithm-old|> :\\\,b:|\>\\,s|)>> <\algorithm-body> If or =\> then return If |\>=\> then apply the partial inverse FFT on > and return For |\>>>, cross > with >> using () >,b>,s+1|)>> For \>\|\>>-m|)>>, cross > with >> using () >,b>,s+1|)>> For \>\>-m|)>>, cross > with >> using () <\big-figure|>||>>|||>|>||>>|||>|>||>>>>>> Schematic representation of the recursive computation of the inverse for and =>. The different images show the progression of the known values > (the black dots) during the different computations at stage . Between the second and third image, we recursively apply the algorithm, as well as between the fourth and fifth image. Applying the algorithm for =,l-1|}>> with l\n>, we have <\theorem> Let >, n> and let \\> be a primitive -th root of unity in >. Then the -tuple ,\,a|)>> can be recovered from its TFT > using at most shifted additions (or subtractions) and |/2|\>> multiplications with powers of >. <\remark> Even though it seems that the linear transformation ,\>:a\> has a non-zero determinant for any >, an algorithm of the above kind does not always work. The simplest example when our method fails is for and =>. Nevertheless, the condition that > is an initial segment for the bit ordering on ,n-1|}>> is not a necessary condition. For instance, a slight modification of the algorithm also works for final segments and certain more general sets like => for . We will observe in the next section that the bit ordering condition on > is naturally satisfied when we take multivariate TFTs on initial segments of >. Let =2>\\\n=2>> and let >\\> be a primitive >-th root of unity for each . Given subsets ,\> of =,n-1|}>\\\,n-1|}>> and a mapping \\>, the multivariate TFT of is a mapping =TFT\\,\>,\,\>>:\\\> defined by <\eqnarray*> :\>|>|>>|,\,i|)>>|>|,\,i>=,\,j|)>\\>a,\,j>*\>*|]>>>*\*\>*|]>>>>>>> See figure for an example with =\> in dimension . The multivariate TFT and its inverse are computed in a similar way as in the univariate case, using one of the encodings from see section of tuples in > by integers in ,n*\*n-1|}>> and using the corresponding FFT-profile determined by \,\,\> instead of ,\,\,\>. We generalize the bit-ordering >> on sets of the form ,2-1|}>> to sets > of indices by <\equation*> ,\,i|)>\,\,j|)>\i\j\\\i\j. If > is one of the encodings ,\,n>> or ,\,n>> from section , then it can be checked that <\equation*> ,\,i|)>\,\,j|)>\\,\,i|)>\\,\,j|)> If =\> is an initial segment for the bit ordering, this ensures that the inverse TFT also generalizes to the multivariate case. <\big-figure|>,\>|)>>>||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||>|||||>|>|>||||>|>|>|>|||>|>|>|>|>||>|>|>|>|>|>|>|>|>|>|>|>|>>>>>\|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||)>>|||||>||)>>|>||||>||)>>|>|>|||>||)>>|>|>|>||>||)>>|>|>|>|>|>|>|,1|)>>|,1|)>>|,1|)>>|,1|)>>|,1|)>>>>>>>>> Illustration of a in two variables (=\>). \; From the complexity analysis point of view, two special cases are particularly interesting. The first is when =,l-1|}>\\\,l-1|}>> with /2\l\n> for all . Then using the lexicographical encoding, the multivariate TFT can be seen as a succession of univariate TFTs whose coefficients are mappings \\>, with <\equation*> \=,l-1|}>\\\,l-1|}>\,l-1|}>\\\,l-1|}>. Setting *\*n>, *\*l> and *\*p>, theorems and therefore generalize to: <\theorem> With the above notations, the direct and inverse TFT of a mapping \\\> can be computed using at most =l*|l>+\+|l>|)>> shifted additions (or subtractions) and |\/2|\>> multiplications with powers of the >. <\remark> The power series analogue of the block case, the problem of multiplication in the ring =\,\,z|]>/>,\,z>|)>> is also an interesting. The easiest instance of this problem is when =\=l=l=2> for some . In that case, we may introduce a new variable and compute in ,\,z|]>/-t,\,z-t,t-1|)>> instead of >, where is the smallest power of two with d+1>. This gives rise to yet another FFT-profile of the form ,\,\,\,\,\,\,\,\,\|)>> and a complexity in *log l|)>>. The trick generalizes to the case when the > are all powers of two, but the general case seems to require non-binary FFT steps. The other important is when =\=n> and =,\,i|)>:i+\+i\l|}>> for some fixed with /2\l\n> (and more generally, one may consider weighted degrees). In this case, we choose the simultaneous encoding for the multivariateTFT (see figure for an example). Now the size of > is =>. At stages ,\,d*i-1>, we notice that the number of ``active nodes'' for the TFT is bounded by =-1|d>*min,/2|)>|)>>. When >, we have <\equation*> B+\+B>=O*log s|)>, which proves: <\theorem> With the above notations, and assuming that >, the direct and inverse TFT of a mapping \\\> can be computed using *log s|)>> shifted additions (or subtractions) and multiplications with powers of >>. <\remark> \ The complexity analysis of the simplicial case in 5> contained a mistake. Indeed, for the multivariate TFT described there, the computed transformed coefficient ,0>> does not depend on ,0,l-1>>, which is incorrect. <\remark> Theorem may be used to multiply formal power series in a similar way as in. This multiplication algorithm requires an additional logarithmic overhead. Contrary to what was stated in , we again need the assumption that >. <\big-figure|>||>||>||>>|||||||>|>||>||>||>>>>> Illustration of the different stages of a bivariate simplicial TFT for =n=l=8>. The black dots correspond to the active nodes at each stage. The arrows indicate two nodes which will be crossed between the current and the next stage. In order to implement the TFT and its inverse, one first has to decide how to represent the sets > and >, and mappings \\>. A convenient approach is to represent >, > and (for the direct TFT) or >, > and (for the inverse TFT) by a single binary ``TFT-tree''. Such a binary tree corresponds to what happens on a subset \,n-1|}>> of the form =,\,*2-1|}>> with \>. The binary tree is of one of the following forms: <\description> When the tree is reduced to a leaf, then it explicitly contains a map \\\>, which is represented by an array of length > in memory. By convention, if is reduced to a null-pointer, then represents the zero map. Moreover, the node contains a flag in order to indicate whether \\> or \\> (for leafs, it is assumed that we either have \\> or \\=\> and similarly for >). When the tree consists of a binary node with subtrees > and >, then> encodes what happens on the interval >>, whereas > encodes what happens on the interval >>. For convenience, the tree still contains a flag to indicate whether \\\\> or \\\\>. Notice that the bounds of the interval > need not to be explicitly present in the representation of the tree; when needed, they can be passed as parameters for recursive functions on the trees. Only the direct TFT has currently been implemented in . This implementation roughly follows the above ideas and can be used in combination with arbitrary FFT-profiles. However, the basic arithmetic with ``TFT-trees'' has not been very well optimizedyet. In tables --, we have given a few benchmarks on a 2.4GHz AMD architecture with 512Mb of memory. We use /2+1|)>*\> as our coefficient ring and we tested the simplicial multivariate TFT for total degree > n> and dimension . Our table both shows the size of the input |\|>>, the real and average running times > and =t/s>, the theoretical number of crossings to be computed >, its average =c/s> and the ratio /c=t/c>. The number > correspond to the running time divided by the running time of the univariate TFT for the same input size. The tables both show the most favourable cases when is a power of two and the worst cases when is a power of two plus one. <\big-table||||||>|>|>|> in >|> in s>|>>|>>|/c>>|>>>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>>>>> Timings for the direct univariate TFT of degree > n>. <\big-table|||||||||||>|>|>|> in >|> in s>|>>|>>|/c>>|>>>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>>>>> Timings for the direct bivariate TFT of total degree > n>. <\big-table|||||||||||||||||||||>|>|>|> in >|> in s>|>>|>>|/c>>|>>>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>>>>> Timings for the direct three-dimensional TFT of total degree > n>. The last colored rectangles correspond to an input for which the computer started heavy swapping. \; <\big-table|||||||||||||||||||>|>|>|> in >|> in s>|>>|>>|/c>>|>>>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>>>>> Assorted timings for higher dimensional simplicial TFTs. A few conclusions can be drawn from tables --. First of all, in the univariate case, theTFT behaves more or less as theoretically predicted. The non-trivial ratios /c> in higher dimensions indicate that the set > is quite disconnected. Hence, optimizations in the implementation of TFT-trees may cause non-trivial gains. The differences between the running times for the best and worse cases in higher dimensions show that the multivariate TFT is still quite sensible to the jump phenomenon (even though less than a full FFT). Finally, in the best case, the ratio > be comes reasonably small. Indeed, > should theoretically tend to and should be bounded by ! in order to gain something a full FFT. However, > can become quite large in the worse case. The various observations suggest the following future improvements. In the case of simplicial TFTs, the number of active nodes swells quite dramatically at the first stages of the TFT (the size may roughly be multiplied by a factor >). In this special case, it may therefore be good to compress the first steps into a single step. Moreover, one benefits from the fact that this compression can be done in a particularly efficient way. Indeed, let >,\,i>> be the transform of ,\,i>> after steps. We have <\equation*> >,\,i>=a,\,i>+a+n/2,\,i>+\+a,\,i+n/2> for all ,\,i|)>\\\,n/2-1|}>> and <\equation*> >,\,i,i+n/2,i,\,i>=>,\,i>-2*a,\,i,i+n/2,i,\,i> for each ,d|}>>. It can be hoped that this acceleration approximately reduces the current running times for the worst case to the current running times for the best case. For high dimensions, it may be useful to apply the trick a second time for the next stages ,2*d>; however, the corresponding formulas become more complicated. We hope that further study will enable the replacement of the condition > in theorem by a better condition, like >. At the other back-end, the overhead in the manipulation of TFT-trees may be reduced by allowing only for leafs whose corresponding arrays have a minimal size of , , , or more. Also, if the intersection of > with the interval > of a TFT-tree has a high density (say >1/2>), then we may replace the TFT-tree by a leaf. When the TFT is used for polynomial or power series multiplication, and is slightly above a power of two >, then classical techniques may be used for reducing the jump phenomenon. For instance, the polynomials and may be decomposed >+A>> and >+B>>, where >> is the part of total degree> 2>, >=A-A>> and similarly for . Then >*B>> is computed using the TFT at order > and the remainder >*B>+A>*B>+A>*B>> by a more naive method. Assume now that > is a ring such that |]>> has many >-th roots of unity. Typically, this is the case when > is the ``field'' of floating point numbers. A disadvantage of the standard FFT or TFT in this context is that the size of the input is doubled in order to represent numbers in |]>>. This doubling of the size actually corresponds to the fact that the FFT or TFT contains redundant information in this case. Indeed, given +\+a X> with >, we have 1>|)>=|)>|\>> for any -th root of unity >. This suggests to evaluate either in > or 1>>. With the notations from section , let \,n-1|}>> be an initial segment for the bit ordering >>. As our target set >, we take <\equation*> \=\\,12,\,n/2,\,3*n/4-1|}>, The ``real TFT'' now consists of applying the usual TFT with source > and target >. It can be checked that crossings of ``real nodes'' introduce ``complex nodes'' precisely then when one of the two nodes disappears at the next stage (see figure ). It can also be checked that the inverse TFT naturally adapts so as to compute the inverse realTFT. For ``sufficiently connected'' sets >, like =,l-1|}>>, the real TFT is asymptotically twice as fast as the complex TFT. <\big-figure|> Schematic representation of a real FFT. The white nodes correspond to real numbers and the black nodes to complex numbers. <\bibliography|bib|alpha|relaxed> <\bib-list|[99]> D.G. Cantor and E.Kaltofen. On fast multiplication of polynomials over arbitrary algebras. , 28:693--701, 1991. J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex Fourier series. , 19:297--301, 1965. A.Schönhage and V.Strassen. Schnelle Multiplikation grosser Zahlen. , 7:281--292, 1971. J.vander Hoeven. Relax, but don't be too lazy. , 34:479--542, 2002. J.vander Hoeven. The truncated Fourier transform and applications. In J.Gutierrez, editor, , pages 290--296, Univ. of Cantabria, Santander, Spain, July 4--7 2004. J.van der Hoevenet al. Mmxlib: the standard library for Mathemagix, 2002--2005. . <\initial> <\collection> <\references> <\collection> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > <\auxiliary> <\collection> <\associate|bib> vdH:tft CT65 SS71 CK91 vdH:tft vdH:tft vdH:tft vdH:mml vdH:tft SS71 CK91 vdH:relax vdH:mml vdH:relax vdH:tft vdH:tft vdH:tft vdH:tft vdH:tft <\associate|figure> <\tuple|normal> Schematic representation of a Fast Fourier Transform for |n=16>. The black dots correspond to the |x>, the upper row being |,\,x|)>=,\,a|)>> and the lower row |,\,x|)>=,,,,\,|)>>. > <\tuple|normal> Schematic representation of a |TFT> for |n=16> and |l=11>. > <\tuple|normal> Schematic representation of a |TFT> for |n=16> and |\=>. > <\tuple|normal> Schematic representation of an asymmetric |TFT> for |n=16> with |\=> and |\=>. > <\tuple|normal> Schematic representation of the recursive computation of the inverse |TFT> for |n=16> and |\=>. The different images show the progression of the known values |x> (the black dots) during the different computations at stage |s=0>. Between the second and third image, we recursively apply the algorithm, as well as between the fourth and fifth image. > <\tuple|normal> Illustration of a |TFT> in two variables (|\=\>). > <\tuple|normal> Illustration of the different stages of a bivariate simplicial TFT for |n=n=l=8>. The black dots correspond to the active nodes at each stage. The arrows indicate two nodes which will be crossed between the current and the next stage. > <\tuple|normal> Schematic representation of a real FFT. The white nodes correspond to real numbers and the black nodes to complex numbers. > <\associate|table> <\tuple|normal> Timings for the direct univariate TFT of degree |||x>> n>. > <\tuple|normal> Timings for the direct bivariate TFT of total degree |||x>> n>. > <\tuple|normal> Timings for the direct three-dimensional TFT of total degree |||x>> n>. The last colored rectangles correspond to an input for which the computer started heavy swapping. > <\tuple|normal> Assorted timings for higher dimensional simplicial TFTs. > <\associate|toc> |math-font-series||font-shape||1.Introduction> |.>>>>|> |math-font-series||font-shape||2.The Fast Fourier Transform> |.>>>>|> |math-font-series||font-shape||3.The multivariate FFT> |.>>>>|> |math-font-series||font-shape||4.Implementing the FFT> |.>>>>|> |Roots of unity |.>>>>|> > |The FFT-profile |.>>>>|> > |Partial FFT steps |.>>>>|> > |Preservation of locality |.>>>>|> > |Choosing the appropriate FFT-model |.>>>>|> > |math-font-series||font-shape||5.The Truncated Fourier Transform> |.>>>>|> |math-font-series||font-shape||6.Inverting the Truncated Fourier Transform> |.>>>>|> |math-font-series||font-shape||7.The multivariate TFT> |.>>>>|> |math-font-series||font-shape||8.Implementing the TFT> |.>>>>|> |math-font-series||font-shape||9.Improving the implementation of the TFT> |.>>>>|> |Avoiding expression swell |.>>>>|> > |Avoiding small leafs |.>>>>|> > |Reducing the jump phenomenon |.>>>>|> > |math-font-series||font-shape||10.The TFT for real coefficients> |.>>>>|> |math-font-series||font-shape||Bibliography> |.>>>>|>