<\body> <\hide-preamble> >> >> >> |||>>|||<\author-affiliation> LIX, CNRS, École polytechnique, 91128 Palaiseau Cedex, France >>> Let (n)> denote the bit complexity of multiplying -bit integers, let \> be an exponent for matrix multiplication, and let > n> be the iterated logarithm. Assuming that > and that (n)/> is increasing, we prove that d> matrices with bit integer entries may be multiplied in <\equation*> O(d*(n)+d>*n*2> n-lg> d|)>>*/lg d) bit operations. In particular, if is large compared to , say >, then the complexity is only *(n)|)>>. |||> In this paper we study the complexity of multiplying d> matrices whose entries are integers with at most bits. We are particularly interested in the case that is very large compared to, say >. All complexity bounds refer to deterministic bit complexity, in the sense of the multi-tape Turing model . Matrices with large integer coefficients appear naturally in several areas. One first application is to the efficient high precision evaluation of so-called holonomic functions (such as exp, log, sin, Bessel functions, and hypergeometric functions) using a divide and conquer technique. Another application concerns recent algorithms for computing the -series of algebraic curves . The practical running time in these applications is dominated by the multiplication of matrices with large integer entries, and it is vital to have a highly efficient implementation of this fundamental operation. Typical parameters for these applications are around > bits, and around 10. In this paper, we focus mainly on theoretical bounds. We write > for the cost of the d> matrix multiplication, and \> for the cost of multiplying -bit integers. We will also write > for the algebraic complexity of multiplying d> matrices whose entries are polynomials of degrees n> over an abstract effective ring , and \>. Schönhage and Strassen used fast Fourier transforms (FFTs) to prove that =O> for large . Fürer improved this to =O> n|)>>|)>> where >> is the iterated logarithm, i.e., <\equation*> |>|log n\,>>|> n>|>|\:lgk> n\1|}>,>>|k>>>|>|\|k\>\,>>>>> and this was recently sharpened to =O> n>|)>> . The best currently known bound for > is =O>; if is a ring of finite characteristic this may be improved to =O> n>|)>> . The algebraic complexity of d> matrix multiplication is usually assumed to be of the form >|)>>, where > is a so-called exponent of matrix multiplication12>. Classical matrix multiplication yields =3>, and Strassen's algorithm achieves =log 7/log 2\2.807>. The best currently known exponent \2.3728639> was found by Le Gall. When working over the integers and taking into account the growth of coefficients, the general bound for matrix multiplication specialises to <\eqnarray*> >||>*|)>.>>>> Throughout this paper we will enforce the very mild restriction that >. Under this assumption the above bound simplifies to <\eqnarray*> >||>*|)>.>>>> The main result of this paper is the following improvement. <\theorem> Assume that /> is increasing. Let 1> be a constant. Then <\eqnarray*> >||*(n)+d>*n*2> n-lg> d|)>>*(lg d)/lg d),>>>> uniformly for all 1>, 2> with C*n>. In particular, if is large compared to , say >, then simplifies to <\eqnarray*> >||*|)>.>>>> This bound is essentially optimal (up to constant factors), in the sense that we cannot expect to do better for , and the bound grows proportionally to the input and output size as a function of . The new algorithm has its roots in studies of analogous problems in the algebraic complexity setting. When working over an arbitrary effective ring , a classical technique for multiplying polynomial matrices is to use an evaluation-interpolation scheme. There are many different evaluation-interpolation strategies2.1--2.3> such as Karatsuba, Toom--Cook, FFT, Schönhage--Strassen and general multi-point. The efficiency of a particular evaluation-interpolation strategy can be expressed in terms of two quantities: the complexity > of evaluation/interpolation and the number > of evaluation points. In terms of these quantities, we have <\equation> =O*+d>*|)>. If admits many primitive >-th roots of unity, then the FFT provides an efficient evaluation-interpolation strategy that achieves =O> and =O>. Moreover, when using the TFT, one may take =2*n-1>, which is optimal. If is a field of characteristic zero, or a finite field with sufficiently many elements, then Bostan and Schost proved4> that one may achieve =O|)>> and =2*n-1> by evaluating at geometric sequences. Thus, in this situation they obtain <\equation> =O*+d>*n|)>. In the setting of integer coefficients, a popular evaluation-interpolation strategy is Chinese remaindering with respect to many small primes of bit length >. Still assuming that \>, this yields the bound (see1.7>, for instance) <\eqnarray*> >||**log n+*|)>,>>>> and recursive application of this bound yields <\eqnarray*> >||**log n+d>*n*2> n-lg> d|)>>*/lg d|)>.>>>> Comparing with the algebraic bound , we notice an extra factor of in the first term. This factor arises from the cost of computing a certain product tree (and remainder tree) in the Chinese remaindering algorithm. A well-known method that attempts to avoid the spurious factor is to use FFTs. For example, suppose that we are using the Schönhage--Strassen integer multiplication algorithm. This works by cutting up the integers into into chunks of about > bits, and then performs FFTs over a ring of the form />+1|)>*> where \>. We can multiply integer matrices the same way, by cutting up each entry into chunks of about > bits, performing FFTs over , and then multiplying the of Fourier coefficients. When is much larger than , the latter step takes negligible time, and the bulk of the time is spent performing FFTs. Since each matrix entry is only transformed once, this leads to a bound of the type *|)>>, without the extraneous factor. This method is very efficient in practice; both and contain implementations based on number-theoretic transforms (i.e., FFTs modulo word-sized prime numbers). However, the theoretical question remains as to whether the overhead can be removed unconditionally, independently of the ``internal structure'' of the currently fastest algorithm for integer multiplication. Our Theorem shows that this is indeed the case. More precisely, we reduce integer matrix multiplication to the multiplication of matrix polynomials over /p>*> for a suitable prime power >>. The multiplication of such polynomials is done using FFTs. However, instead of using a classical algorithm for computing the FFT of an individual polynomial, we reduce this problem back to integer multiplication using Bluestein's trick and Kronecker substitution8>. This central idea of the paper will be explained in section. In section, we prove our main complexity bound(). We stress that Theorem is a theoretical result, and we do not recommend our algorithm for practical computations. For any given FFT-based integer multiplication algorithm, it should always be better, by a constant factor, to apply the same FFT scheme to the matrix entries directly, as outlined above. See Remark for further discussion about the implied big- constant. <\remark> The observation that the Bluestein--Kronecker combination leads to a particularly efficient FFT algorithm was announced previously in . We mention as a historical remark that the development of the main ideas of the present paper actually preceded . We begin with a lemma that converts a certain polynomial evaluation problem to integer multiplication. <\lem> Assume that /n> is increasing. Let be an odd prime, let \1> be an integer, and let \/p>*|)>>> be an element of order . Given as input /p>*|)>>, with p-1>, we may compute ,F|)>,\,F|)>\/p>*> in time <\equation*> O*p*lg p|)>|)>. <\proof> Let /p>*> and let F*x\S>. We first use Bluestein's trick to convert the evaluation problem to a polynomial multiplication problem. Namely, from the identity +-> we obtain <\equation> F|)>=F*\=H*F*G where <\equation*> H=\>,F=\>*F,G=\>. Since =\*H> and =\*G>, we may easily compute ,\,H> and ,\,G> from > and =\> using > ring operations in . Similarly we may obtain the > from the > using > ring operations. The sum F*G> in may be interpreted as the coefficient of > in the product of the (Laurent) polynomials <\equation*> F=F*x*G=G*x. Thus it suffices to compute the product \*G|)>> in >. To compute this product, we lift the problem to >, and use Kronecker substitution 8> to convert to an integer multiplication problem. The coefficients of > and *G> are bounded by >>, and their degrees by , so the coefficients of their product in > have at most +1>|)>=O*lg p|)>> bits. Therefore the integers being multiplied have at most *p*lg p|)>> bits, leading to the desired *p*lg p|)>|)>> bound. The remaining work consists of > ring operations in , contributing afurther *lg p|)>|)>=O*p*lg p|)>|)>> bit operations since /n> is increasing. <\proposition> Assume that /> is increasing. Let 1> be a constant. Then <\eqnarray*> >||*+>*|)>>>>> uniformly for all 1>, 2> with C*n>. <\proof> The input consists of matrices j>|)>> and j>|)>>, where i,j\d> and j>,Bj>\>, j>|\|>\2>, j>|\|>\2>. We wish to compute the product . Let lg n+lg d> and |n/b|\>>. Note that > since we assumed that >. We split the entries j>> into chunks of bits, choosing j>\> so that j>|)>=Aj>> with j>\m>, and such that the coefficients of j>> are bounded in absolute value by >. Similarly choose j>\> for j>>. Let j>|)>> and j>|)>> be the corresponding d> matrices of polynomials. The coefficients of the entries of are bounded in absolute value by *d*m>, and thus have bit size bounded by >. The product |)>> may be deduced from in time *m*b|)>=O*n|)>>. Thus we have reduced to the problem of computing . The degrees of the entries of are less than . Let be the least odd prime such that 2*m>. By we have |)>=O>. We may find by testing candidates successively; the cost is >, since each candidate requires >|)>> bit operations . To compute , it suffices to compute modulo >>, where \1> is the least positive integer for which >\2\2*d*m>. Since >\2\2*d*m*p> we have *lg p\2*b+lg d+lg m+lg p+1=O>. Our plan is to compute over /p>*> by means of an evaluation-interpolation scheme, using Lemma for the evaluations. The lemma requires a precomputed element \S>> of order . To find >, we first compute a generator of /p*|)>>> in (deterministic) time >|)>=o> , and then lift it to a suitable \S>> in time *lg p|)>*lg p|)>=O*lg n|)>> 15>. This last bound lies in *|)>> (one may check the cases d> and d> separately). Having selected , > and >, we now apply Lemma to each matrix entry to evaluate j>|)>\S> and j>|)>\S> for k\p-1>. This step takes time **p*lg p|)>|)>=O*|)>>. Next we perform the pointwise multiplications |)>=P|)>*Q|)>>. These are achieved by first lifting to matrix products, and then reducing the results modulo>>. The integer products cost *lg p|)>|)>=O*|)>>. The bit size of the entries of the products are bounded by *lg p+lg d=O>, so the reduction step costs *p*|)>=O*|)>>. Since the evaluation is really a discrete Fourier transform over , the interpolation step is algebraically the same, with > replaced by >. Thus we may recover j>> using Lemma again, with cost *|)>>. There is also a final division (scaling) by , whose cost is subsumed into the above. In the Turing model, we must also take into account the cost of data rearrangement. Specifically, in the above algorithm we switch between ``matrix of vectors'' and ``vector of matrices'' representations of the data. Using the algorithm in the Appendix to , this needs only *p*\*lg p|)>*|)>=O*n*log n|)>=O*|)>> bit operations, since we assumed that /> is increasing. <\remark> We could replace /p>*> by a ``ring'' of finite-precision approximations to complex numbers, and obtain results of the same strength. The latter approach has the disadvantage that it introduces a tedious floating-point error analysis. Now we may prove the main theorem announced in the introduction. <\render-proof|Proof of Theorem> First consider the region /C\n\d>. Proposition yields <\eqnarray*> >||*+n*/lg d|)>>>|||**+d>*n*/lg d|)>,>>>> and for such we have > n=lg> d+O>, so the desired bound holds in this region. Now consider the case d>. Let min\:lgk>\d|}>=lg> n-lg> d+O>, and let \lgj> n> for ,k>, so that \d> and n\d.> By Proposition there is a constant1> (depending only on) such that <\eqnarray*> |)>>|>|*|)>+|lg n>*|)>|)>>>>> for any \d>. Starting with \n> and iterating times yields <\eqnarray*> >|>|*n*|)>|n>+|)>|n>+\+*|)>|n>|)>+*n|n>*|)>.>>>> By the argument in the first paragraph, we may apply Proposition once more (and increase if necessary) to obtain <\equation*> >|>|*n*|)>|n>+|)>|n>+\+*|)>|n>|)>+d>*n*K*/lg d.>>>>> The second term lies in >*n*2> n-lg> d|)>>*(lg d)/lg d|)>>. Since /> is increasing, the first term is bounded by <\equation*> K*d**+\+K*log n|log n>|)>\K*d***log lg n|log n>|)>=O*|)>. <\remark> An important question is to determine the best possible big- constant in Theorem. For simplicity, consider the case where is much larger than , and define <\equation*> A\limsup\> limsup\>|d*(n)>. Theorem shows that \>. After some optimisations, it is possible to achieve . (We omit the details. The idea is to increase the chunk size , say from to n>, and use the fact that Bluestein's trick actually produces a negacyclic convolution.) We can do even better if we change the basic problem slightly. Define (n)> to be the cost of an -bit cyclic integer convolution, i.e., multiplication modulo -1>. This kind of multiplication problem is of interest because all of the fastest known multiplication algorithms, i.e., based on FFTs, actually compute convolutions. (In this brief sketch we ignore the difficulty that such algorithms typically only work for belonging to some sparse set.) Let (n)> be the cost of the corresponding d> matrix multiplication (convolution) problem, and let > be the corresponding constant defined as above. Then by mapping the Bluestein convolution directly to integer convolution, one saves a factor of two, obtaining =12>. We conjecture that in fact one can achieve . This conjecture can be proved for all integer multiplication algorithms known to the authors, and it is also consistent with measurements of the performance of the implementation described in . The point is that the implementation transforms each matrix entry exactly once, and the time spent on the small-coefficient matrix multiplications is negligible if is large. <\remark> It is tempting to interpret the bound in Theorem as an analogue of() in the case of integer coefficients. However, several technical problems arise if one wants to make this more precise. Indeed, most ``evaluation-interpolation'' strategies for integers (apart from Chinese remaindering) involve cutting the integers into several chunks, which prevents the evaluation mappings from being algebraic homomorphisms. Moreover, due to carry management, we have to include an additional parameter for the target precision of our evaluations. Thus, in the case of matrix multiplication, we really should be looking for bounds of the form <\eqnarray*> >||*+*|)>,>>>> where stands for the precision at our evaluation points and |lg d|\>>. In terms of => and => for some small fixed precision p>, we have <\eqnarray*> >|>|>>|>|>|.>>>> Reformulated in this way, our new evaluation-interpolation strategy achieves <\eqnarray*> >|>||)>>>|>||> n|)>>,>>>> and it can be applied to several other problems, such as the multiplication of multivariate polynomials or power series with large integer coefficients. The authors thank Arne Storjohann for helpful conversations. The first author was supported by the Australian Research Council, DECRA Grant DE120101293. <\bibliography|bib|amsalpha|zmatmult> >>>|bysame|to3em>|MR|MR >|MRhref||http://www.ams.org/mathscinet-getitem?mr=#1> >|href|>|<\bib-list|HLM+02> Manindra Agrawal, Neeraj Kayal, and Nitin Saxena, , Ann. of Math. (2) (2004), no.2, 781--793. MR 2123939 (2006a:11170) Alin Bostan, Pierrick Gaudry, and Éric Schost, , SIAM J. Comput. (2007), no.6, 1777--1806. MR 2299425 (2008a:11156) R.C. Baker, G.Harman, and J.Pintz, , Proc. London Math. Soc. (3) (2001), no.3, 532--562. MR 1851081 (2002f:11125) L.Bluestein, , Audio and Electroacoustics, IEEE Transactions on (1970), no.4, 451--455. Alin Bostan and Éric Schost, , J. Complexity (2005), no.4, 420--446. MR 2152715 (2006g:12016) D.V. Chudnovsky and G.V. Chudnovsky, , Lect. Notes in Pure and Applied Math. (New-York), vol. 125, Dekker, 1990, pp.109--232. D.G. Cantor and E.Kaltofen, , Acta Informatica (1991), 693--701. D.Coppersmith and S.Winograd, , Proc. of the > Annual Symposium on Theory of Computing (New York City), may 25--27 1987, pp.1--6. Martin Fürer, , SIAM J. Comput. (2009), no.3, 979--1005. MR 2538847 (2011b:68296) FrançoisLe Gall, , Proc. ISSAC 2014 (Kobe, Japan), July 23--25 2014, pp.296--303. David Harvey, , Ann. of Math. (2) (2014), no.2, 783--803. David Harvey, Joris vander Hoeven, and Grégoire Lecerf, , preprint , 2014. ---, , preprint , 2014. J.vander Hoeven, G.Lecerf, B.Mourrain, etal., , 2002, . J.vander Hoeven, G.Lecerf, and G.Quintin, , Tech. report, ArXiv, 2014, . J.vander Hoeven, , TCS (1999), 199--215. ---, , JSC (2001), 717--743. ---, , Proc. ISSAC 2004 (Univ. of Cantabria, Santander, Spain) (J.Gutierrez, ed.), July 4--7 2004, pp.290--296. ---, , JSC (2007), no.4, 389--428. ---, , JSC (2010), no.8, 857--878. B.Haible and T.Papanikolaou, , Tech. Report TI-7/97, Universität Darmstadt, 1997. David Harvey and AndrewV. Sutherland, , Algorithmic Number Theory Eleventh International Symposium (ANTS XI), vol.17, London Mathematical Society Journal of Computation and Mathematics, 2014, pp.257--273. ChristosH. Papadimitriou, , Addison-Wesley Publishing Company, Reading, MA, 1994. MR 1251285 (95f:68082) Igor Shparlinski, , Theoret. Comput. Sci. (1996), no.2, 273--275. MR 1389773 (97a:11203) A.Schönhage and V.Strassen, , Computing (Arch. Elektron. Rechnen) (1971), 281--292. MR 0292344 (45 #1431) Arne Storjohann, , Ph.D. thesis, ETH Zürich, 2000, . V.Strassen, , Numer. Math. (1969), 352--356. Joachim vonzur Gathen and Jürgen Gerhard, , second ed., Cambridge University Press, Cambridge, 2003. MR 2001757 (2004g:68202) > <\initial> <\collection> <\attachments> <\collection> 2$. For $n \\geq d$ we have\n \\[ \\MM_d(n) = 3d^2 \\MM((8 + o(1)) n) + O(d^\\omega n (\\log \\log n)^{1+o(1)}). \\]\n\\end{thm}\nThe $o(1)$ terms are quantities approaching zero as $n \\to \\infty$, uniformly in $d$. If we assume that $d = O(\\log n)$ and that $\\MM(n) = \\Theta(n \\log n)$, then we achieve the bound \\eqref{eq:target}. Moreover, if we assume that $\\MM(n)$ is quasilinear, which is the case for all FFT-based integer multiplication algorithms, we obtain the specific leading constant $24 + o(1)$.\n\nWe stress that this is a theoretical result, and we do not recommend our algorithm for practical computations. For any given FFT-based multiplication algorithm, it should always be better, by a constant factor, to implement a scheme similar to the one outlined for the Sch\\"onhage--Strassen algorithm above. Moreover, we conjecture that in this situation, the constant $24$ can always be replaced by $1$. This conjecture can be proved for all integer multiplication algorithms known to the authors, and it is also consistent with measurements of the performance of the implementation described in \\cite{HS-hassewitt}.\n\n\n\\section{Proof of the theorem}\n\nIn this section we write $\\lg$ for the base 2 logarithm. We begin with a lemma that converts a certain polynomial evaluation problem to integer multiplication.\n\n\\begin{lem}\n\\label{lem:eval}\nLet $p$ be an odd prime and let $\\lambda \\geq 1$ be an integer. Let $\\zeta \\in (\\ZZ/p^\\lambda\\ZZ)^*$ be an element of order $p - 1$. Given as input $F \\in (\\ZZ/p^\\lambda\\ZZ)[x]$, with $\\deg F \< p - 1$, we may compute $F(1), F(\\zeta), \\ldots, F(\\zeta^{p-2}) \\in \\ZZ/p^\\lambda\\ZZ$ in time\n \\[ \\MM((2\\lambda + 2)p \\lg p) + O(p \\MM(\\lambda \\lg p)). \\]\n\\end{lem}\n\\begin{proof}\nLet $S = \\ZZ/p^\\lambda \\ZZ$ and let $F = \\sum_{j=0}^{p-2} F_j x^j \\in S[x]$. We first use Bluestein's trick \\cite{Blu-dft} to convert the evaluation problem to a polynomial multiplication problem. Namely, from the identity $ij = \\binom{i}{2} + \\binom{-j}2 - \\binom{i-j}2$ we obtain\n\\begin{equation}\n\\label{eq:conv}\n F(\\zeta^i) = \\sum_{j=0}^{p-2} F_j \\zeta^{ij} = H_i \\sum_{j=0}^{p-2} F'_j G_{i-j}\n\\end{equation}\nwhere\n \\[ H_k = \\zeta^{\\binom{k}{2}}, \\qquad F'_k = \\zeta^{\\binom{-k}2} F_k, \\qquad G_k = \\zeta^{-\\binom{k}2}. \\]\nSince $H_{k+1} = \\zeta^k H_k$ and $G_{k+1} = \\zeta^{-k} G_k$, we may easily compute $H_0, \\ldots, H_{p-1}$ and $G_0, \\ldots G_{p-1}$ from $\\zeta$ and $\\zeta^{-1} = \\zeta^{p-2}$ using $O(p)$ ring operations in $S$. Similarly we may obtain the $F'_k$ from the $F_k$ using $O(p)$ ring operations. For any $k \\in \\ZZ$ we have\n\\begin{multline*}\n G_{k+p-1} = \\zeta^{-(k+p-2)} \\cdots \\zeta^{-(k+1)} \\zeta^{-k} G_k = (\\zeta^{-k})^{p-1} (\\zeta^{-1})^{1+2+\\cdots+(p-2)} G_k \\\\\n \ \ \ \ \ \ \ \ \ \ = (\\zeta^{-1})^{(p-2)(p-1)/2} G_k = (-1)^{p-2} G_k = -G_k.\n\\end{multline*}\nThis shows that the sum $\\sum_{j=0}^{p-2} F'_j G_{i-j}$ in \\eqref{eq:conv} may be interpreted as a negacyclic convolution of length $p - 1$, i.e., as the multiplication modulo $x^{p-1} + 1$ of the polynomials\n \\[ F' = \\sum_{k=0}^{p-2} F'_k x^k \\qquad \\text{and} \\qquad G' = \\sum_{k=0}^{p-2} G_k x^k. \\]\nThus it suffices to compute the product $F'G'$ in $S[x]$, and then perform $O(p)$ subtractions in $S$.\n\nFinally, to compute $F'G'$, we lift the problem to $\\ZZ[x]$, and use Kronecker substitution \\cite[Ch.~8]{vzGG-compalg} to convert to an integer multiplication problem. The coefficients of $F'$ and $G'$ are bounded by $p^\\lambda$, and their degree is $O(p)$, so the coefficients of their product in $\\ZZ[x]$ have at most $\\lceil \\lg(p^{2\\lambda+1}) \\rceil \\leq (2\\lambda + 2)\\lg p$ bits. Therefore the integers being multiplied have at most $(2\\lambda + 2)p \\lg p$ bits, accounting for the main term in the statement of the lemma. The remaining work consists of $O(p)$ ring operations in $S$, accounting for the secondary term.\n\\end{proof}\n\n\n\\begin{proof}[Proof of Theorem \\ref{thm:main}]\nThe input consists of matrices $A = (A_{ij})$ and $B = (B_{ij})$, where $1 \\leq i, j \\leq d$ and $A_{ij}, B_{ij} \\in \\ZZ$, $\|A_{ij}\| \< 2^n$, $\|B_{ij}\| \< 2^n$. We wish to compute the product $AB$.\n\nLet $b = \\lceil \\lg^2 n \\rceil$ and $m = \\lceil n / b \\rceil$. We split the entries $A_{ij}$ into chunks of size $b$, choosing $P_{ij} \\in \\ZZ[x]$ so that $P_{ij}(2^b) = A_{ij}$ with $\\deg P_{ij} \< m$, and such that the coefficients of $P_{ij}$ are bounded in absolute value by $2^b$. Similarly choose $Q_{ij} \\in \\ZZ[x]$ for $B_{ij}$. Let $P = (P_{ij})$ and $Q = (Q_{ij})$ be the corresponding $d \\times d$ matrices of polynomials. The coefficients of the entries of $PQ$ are bounded in absolute value by $2^{2b}dm$, and thus have bit size bounded by\n\\begin{equation}\n\\label{eq:bits}\n \\lceil 2b + \\lg d + \\lg m \\rceil = 2 \\lg^2 n + O(\\log n) = (2 + o(1)) \\lg^2 n.\n\\end{equation}\nNote that $AB = (PQ)(2^b)$ may be deduced from $PQ$ in time $O(d^2 m \\log^2 n) = O(d^2 n)$. Thus we have reduced to the problem of computing $PQ$.\n\nThe degrees of the entries of $PQ$ are less than $2m$. Let $p$ be the least prime such that $p \\geq 2m$. We may compute $p$ by testing integers successively until we find a prime. The number of integers to test is at most $O(m^{0.525})$ \\cite{BHP-consecutive}, and each candidate can be tested in $O((\\log p)^{O(1)})$ bit operations \\cite{AKS-primes}; thus we can certainly find $p$ in $o(n)$ bit operations. Moreover we have\n\\begin{equation}\n\\label{eq:p-bound}\n p = 2m + O(m^{0.525}) = (2 + o(1)) \\frac{n}{\\lg^2 n}\n\\end{equation}\nand\n\\begin{equation}\n\\label{eq:lgp-bound}\n \\lg p = \\lg(2m) + O(1) = \\lg n + O(\\log \\log n) = (1 + o(1)) \\lg n.\n\\end{equation}\n\nLet $\\lambda \\geq 1$ be the least positive integer so that $p^\\lambda \> 2 \\cdot 2^{2b} dm$. Then to compute $PQ$ it suffices to compute $PQ$ modulo $p^\\lambda$. By \\eqref{eq:bits} and \\eqref{eq:lgp-bound} we have\n\\begin{equation}\n\\label{eq:lambda-bound}\n \\lambda = \\frac{(2 + o(1)) \\lg^2 n}{(1 + o(1)) \\lg n} = (2 + o(1)) \\lg n.\n\\end{equation}\n\nWe may find a generator of $(\\ZZ/p\\ZZ)^*$ deterministically in time $O(p^{1/4+\\epsilon}) = O(n^{1/4+\\epsilon})$ \\cite{Shp-primitive}, and lift it to an element $\\zeta \\in (\\ZZ/p^\\lambda\\ZZ)^*$ of order $p - 1$ in time $O(\\lambda \\log^{2+\\epsilon} p) = O(\\log^{3+\\epsilon} n)$ \\cite[Ch.~15]{vzGG-compalg}.\n\nLet $S = \\ZZ/p^\\lambda\\ZZ$. To compute $PQ$ over $S$, we first use Lemma \\ref{lem:eval} to evaluate $P_{ij}(\\zeta^k) \\in S$ and $Q_{ij}(\\zeta^k) \\in S$ for $0 \\leq k \< p - 1$. This step takes time\n \\[ 2d^2 \\MM((2 \\lambda + 2) p \\lg p) + O(d^2 p \\MM(\\lambda \\lg p)). \\]\nNext we perform the pointwise multiplications $(PQ)(\\zeta^k) = P(\\zeta^k) Q(\\zeta^k)$; this costs\n \\[ O(d^\\omega p \\MM(\\lambda \\lg p)). \\]\nSince the evaluation is really a discrete Fourier transform over $S$, the interpolation step is algebraically the same, with $\\zeta$ replaced by $\\zeta^{-1}$. Thus we may recover $(PQ)_{ij}$ using Lemma \\ref{lem:eval} again, and this costs\n \\[ d^2 \\MM((2 \\lambda + 2)p \\lg p) + O(d^2 p \\MM(\\lambda \\lg p)). \\]\nThere is also a final division (scaling) by $p - 1$; this costs at most $O(d^2 p \\MM(\\lambda \\lg p))$.\n\nBy \\eqref{eq:p-bound}, \\eqref{eq:lgp-bound} and \\eqref{eq:lambda-bound} we have\n \\[ (2\\lambda + 2)p \\lg p = (8 + o(1)) n, \\]\nleading to the main term of the complexity bound. The secondary term is\n \\[ O(d^\\omega p \\MM(\\lambda \\lg p)) = O(d^\\omega (n / \\lg^2 n) \\MM(\\lg^2 n)) = O(d^\\omega n (\\log \\log n)^{1+o(1)}), \\]\nassuming we handle the ``small'' multiplications with the Sch\\"onhage--Strassen algorithm.\n\nIn the Turing model, we must also take into account the cost of data rearrangement. Specifically, in the above algorithm we switch between ``matrix of vectors'' and ``vector of matrices'' representations of the data. Using the algorithm in the Appendix to \\cite{BGS-recurrences}, this needs only\n \\[ O(d^2 m \\lambda \\log p \\log d) = O(d^\\omega n) \\]\nbit operations, since we assumed $\\omega \> 2$.\n\\end{proof}\n\n\\section*{Acknowledgments}\n\nThe authors thank Arne Storjohann for helpful conversations. The first author was supported by the Australian Research Council, DECRA Grant DE120101293.\n\n\n\\bibliographystyle{amsalpha}\n\\bibliography{zmatmult}\n\n\\end{document}\n> <\associate|latex-target> > <\body> <\hide-preamble> > > > >>> >>> >>> |||>>|||>>> (n)> denote the bit complexity of multiplying -bit integers. We prove that d> matrices with -bit integer entries may be multiplied in (n))> bit operations, provided that and (n)=\(n*log n)>.> > In this paper we study the complexity of multiplying d> matrices whose entries are integers with at most bits, where is large relative to , say >. The entries of the matrices may be positive, negative, or zero. All complexity bounds refer to deterministic bit complexity, in the sense of the multi-tape Turing model . -series of algebraic curves . The running time of these algorithms is dominated in practice by the multiplication of matrices with large integer entries, and it is vital to have a highly efficient implementation of this fundamental operation. Typical parameters for this application are around > bits, and up to about .> <\mlx|1811:2658> In this paper, we focus mainly on theoretical bounds. We write > for the cost of the d> multiplication, and => for the cost of multiplying -bit integers. Classical matrix multiplication yields <\equation*> =O|)>, and ``fast'' matrix multiplication algorithms, such as Strassen's algorithm, lead to <\equation*> =O>|)>, where \3> is an exponent of matrix multiplication 12>. The main result of this paper is that, under certain mild conditions, we may achieve <\equation> =O|)>.> This bound is essentially optimal (up to constant factors), in the sense that we cannot expect to do better for , and the bound grows proportionally to the input and output size as a function of . <\mlx|2660:3695> Storjohann mentions in his thesis an algorithm that leads to the bound <\equation> =O**log n+d>*|)>.> The idea is to reduce the matrices modulo many small primes of bit length >, and multiply modulo each prime separately. The > term accounts for the cost of fast simultaneous modular reduction and Chinese remaindering 10> for the |)>> entries of the input and output matrices, and the >> term covers the matrix multiplications modulo primes. This algorithm may be viewed as employing an evaluation-interpolation scheme, where the modular reduction corresponds to the evaluation and the Chinese remaindering to the interpolation. Assuming that >, and taking \3> and =\>, Storjohann's bound becomes simply <\equation*> =O**log n|)>, which falls behind our target bound by a factor of . <\mlx|3697:4436> It is well known that in the context of matrix multiplication, in certain circumstances the factor can be avoided. Consider, in an algebraic complexity model over a commutative ring , the cost > of multiplying d> matrices whose entries are polynomials in > of degree less than . As above we write =>. If is a field supporting fast Fourier transforms of arbitrary lengths, then using an evaluation-interpolation scheme one obtains <\equation> =O**n*log n+d>*n|)>.> Assuming that =\*> and >, this bound becomes <\equation*> =O|)>. .> <\mlx|4679:5285> Consider first the original integer multiplication problem. The algorithm begins by splitting the inputs into small chunks, thereby converting the problem into a polynomial multiplication problem in >, where the polynomials have degree > and coefficient bit size >. Then one multiplies the polynomials by using FFTs over the complex numbers, carrying > bits of precision throughout. This leads to a bound for >, expressed recursively as <\equation> =O**log |)>=O|)>.> <\mlx|5287:6212> Let us see how to adapt this to the corresponding d> matrix multiplication problem. We mention in passing that the following strategy is in fact very close to what is done in the implementation of matrix multiplication in , except that there we work over finite fields instead of >. One splits the matrix entries into chunks, converting the problem to a matrix multiplication problem over >, where the polynomials again have degree > and coefficient bit size >. Then one computes the FFT of each matrix entry, multiplies the matrices of Fourier coefficients, and computes the inverse FFT of each entry. This leads to the bound <\equation*> =O**n+d>*|)>. (Here we have ignored the contribution of to the required floating-point precision. This carelessness is justified provided is large compared to .) . For the first term, we note that we have =O|)>> (compare with ). Thus the first term is the desired |)>>.> =O*> , or Fürer's asymptotically faster algorithm . In each case, the point is to the FFTs from the matrix multiplications; for large enough compared to , the algorithm should spend almost all of its time computing Fourier transforms, and very little time multiplying matrices, leading to the bound =O|)>>. The difficulty is that in each case, the construction of the algorithm depends on knowledge of the ``internal structure'' of the specific multiplication algorithm under consideration. Perhaps there is an asymptotically faster integer multiplication algorithm yet to be discovered, which does not rely on FFTs, or even any evaluation-interpolation scheme at all. How can we still prove a bound like for such a hypothetical algorithm?> <\mlx|7537:8297> Returning to the algebraic case for inspiration, we recall a result of Bostan and Schost 4> that proves a bound similar to , but with weaker assumptions on . They show that if is a field of characteristic zero, or a finite field with sufficiently many elements, then <\equation*> =O*+d>*n|)>. Comparing with , we see that the *n*log n|)>> term has been replaced with |)>>. The idea of their algorithm is to replace the set of evaluation points, i.e., the roots of unity in the FFT, by a suitable geometric progression. Crucially, they show how to reduce the corresponding evaluation and interpolation problems to . <\mlx|8299:9667> This idea leads to our new algorithm for integer matrix multiplication, which may be outlined as followed. We first convert the integer matrix multiplication problem to a polynomial matrix multiplication problem over >, as sketched for the Schönhage--Strassen algorithm above. We switch to working over a finite ring of the form /p>>, for a suitable prime and -adic precision >. We carry out the polynomial matrix multiplication by using Fourier transforms over , and convert the transforms to polynomial multiplication over by using Bluestein's trick. The novel step is that we convert these polynomial multiplications by using Kronecker substitution. In the next section we will execute this plan to prove the following theorem. <\thm> Assume that \2>. For d> we have <\equation*> =3*d|)>*n|)>+O*>*n>|)>. The > terms are quantities approaching zero as \>, uniformly in . If we assume that > and that =\*>, then we achieve the bound . Moreover, if we assume that > is quasilinear, which is the case for all FFT-based integer multiplication algorithms, we obtain the specific leading constant >. can always be replaced by . This conjecture can be proved for all integer multiplication algorithms known to the authors, and it is also consistent with measurements of the performance of the implementation described in .> <\mlx|10289:10480> In this section we write for the base 2 logarithm. We begin with a lemma that converts a certain polynomial evaluation problem to integer multiplication. <\lem> Let be an odd prime and let \1> be an integer. Let \/p>|)>>> be an element of order . Given as input /p>|)>>, with p-1>, we may compute ,F|)>,\,F|)>\/p>> in time <\equation*> +2|)>*p*lg p|)>+O*lg p|)>|)>. <\proof> <\mlx|10883:12367> Let /p>> and let F*x\S>. We first use Bluestein's trick to convert the evaluation problem to a polynomial multiplication problem. Namely, from the identity +-> we obtain <\equation> F|)>=F*\=H*F*G> where <\equation*> H=\>,F=\>*F,G=\>. Since =\*H> and =\*G>, we may easily compute ,\,H> and ,\*G> from > and =\> using > ring operations in . Similarly we may obtain the > from the > using > ring operations. For any > we have =\>*\*\>*\*G=|)>|)>+>*G>>||)>*/2>*G=(-1)G=-G.>>>>> This shows that the sum F*G> in may be interpreted as a negacyclic convolution of length , i.e., as the multiplication modulo +1> of the polynomials <\equation*> F=F*x*G=G*x. Thus it suffices to compute the product *G> in >, and then perform > subtractions in . *G>, we lift the problem to >, and use Kronecker substitution 8> to convert to an integer multiplication problem. The coefficients of > and > are bounded by >>, and their degree is >, so the coefficients of their product in > have at most |lg +1>|)>|\>\+2|)>*lg p> bits. Therefore the integers being multiplied have at most +2|)>*p*lg p> bits, accounting for the main term in the statement of the lemma. The remaining work consists of > ring operations in , accounting for the secondary term.> <\proof> >The input consists of matrices |)>> and |)>>, where i,j\d> and ,B\>, |\|>\2>, |\|>\2>. We wish to compute the product . <\mlx|13254:14080> Let |lg n|\>> and |n/b|\>>. We split the entries > into chunks of size , choosing \> so that |)>=A> with \m>, and such that the coefficients of > are bounded in absolute value by >. Similarly choose \> for >. Let |)>> and |)>> be the corresponding d> matrices of polynomials. The coefficients of the entries of are bounded in absolute value by *d*m>, and thus have bit size bounded by <\equation> |2*b+lg d+lg m|\>=2*lg n+O=|)>*lg n.> Note that |)>> may be deduced from in time *m*log n|)>=O**n|)>>. Thus we have reduced to the problem of computing . <\mlx|14082:14738> The degrees of the entries of are less than . Let be the least prime such that 2*m>. We may compute by testing integers successively until we find a prime. The number of integers to test is at most |)>> , and each candidate can be tested in >|)>> bit operations ; thus we can certainly find in > bit operations. Moreover we have <\equation> p=2*m+O|)>=|)>* n>> and <\equation> lg p=lg +O=lg n+O=|)>*lg n.> <\mlx|14740:15083> Let \1> be the least positive integer so that >\2\2*d*m>. Then to compute it suffices to compute modulo >>. By and we have <\equation> \=|)>*lg n||)>*lg n>=|)>*lg n.> /p|)>>> deterministically in time >|)>=O>|)>> , and lift it to an element \/p>|)>>> of order in time *log> p|)>=O> n|)>> 15>.> <\mlx|15391:16211> Let /p>>. To compute over , we first use Lemma to evaluate |)>\S> and |)>\S> for k\p-1>. This step takes time <\equation*> 2*d+2|)>*p*lg p|)>+O**p*lg p|)>|)>. Next we perform the pointwise multiplications |)>=P|)>*Q|)>>; this costs <\equation*> O*>*p*lg p|)>|)>. Since the evaluation is really a discrete Fourier transform over , the interpolation step is algebraically the same, with > replaced by >. Thus we may recover > using Lemma again, and this costs <\equation*> d+2|)>*p*lg p|)>+O**p*lg p|)>|)>. There is also a final division (scaling) by ; this costs at most *p*lg p|)>|)>>. <\mlx|16213:16621> By , and we have <\equation*> +2|)>*p*lg p=|)>*n, leading to the main term of the complexity bound. The secondary term is <\equation*> O*>*p*lg p|)>|)>=O*>* n|)> n|)>|)>=O*>*n>|)>, assuming we handle the ``small'' multiplications with the Schönhage--Strassen algorithm. <\mlx|16623:17019> In the Turing model, we must also take into account the cost of data rearrangement. Specifically, in the above algorithm we switch between ``matrix of vectors'' and ``vector of matrices'' representations of the data. Using the algorithm in the Appendix to , this needs only <\equation*> O**m*\*log p*log d|)>=O*>*n|)> bit operations, since we assumed \2>. The authors thank Arne Storjohann for helpful conversations. The first author was supported by the Australian Research Council, DECRA Grant DE120101293. <\bibliography|bib|amsalpha|zmatmult> <\references> <\collection> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > <\auxiliary> <\collection> <\associate|bib> Pap-complexity CC90 HP97 vdH:hol vdH:singhol vdH:reshol Har-avgpoly HS-hassewitt SS-multiply Fur-faster HLvdH-zmult CaKa91 HLvdH-ffmult vzGG-compalg Str69 Gall14 CW90 vdH:fnewton vdH:tft BS-special Sto-thesis HS-hassewitt vdH:mmx vdH:simd Blu-dft vzGG-compalg HLvdH-zmult HLvdH-zmult Blu-dft vzGG-compalg BHP-consecutive AKS-primes Shp-primitive vzGG-compalg BGS-recurrences HS-hassewitt vdH:simd <\associate|toc> |math-font-series||1.Introduction> |.>>>>|> |math-font-series||2.Bluestein--Kronecker reduction> |.>>>>|> |math-font-series||3.Integer matrix multiplication> |.>>>>|> |math-font-series||Acknowledgments> |.>>>>|> |math-font-series||Bibliography> |.>>>>|>