> <\body> <\hide-preamble> |||||||||||>>>|>| ||> >>>>>> at various precisions using AMX and IFMA>|<\doc-note> This work has been supported by an ERC-2023-ADG grant for the ODELIX project (number 101142171). <\wide-tabular> ||||||| ||LOGO_ERC-FLAG_FP.png>|61.62pt|28.51pt||>>>>> ||<\author-affiliation> Laboratoire d'informatique de l'École polytechnique LIX, UMR 7161 CNRS Bâtiment Alan Turing, CS35003 1, rue Honoré d'Estienne d'Orves 91120 Palaiseau, France |>>|||<\author-affiliation> Laboratoire d'informatique de l'École polytechnique LIX, UMR 7161 CNRS Bâtiment Alan Turing, CS35003 1, rue Honoré d'Estienne d'Orves 91120 Palaiseau, France |>>|||<\doc-note> > In this paper we study hardware-accelerated integer matrix multiplication, with coefficients of sizes between 8 and 1000 bits. More particularly, we study two relatively new hardware features in Intel CPUs: the IFMA \Pinteger FMA\Q instruction and the AMX matrix extensions. We study various algorithms and analyze to what extent our implementations on top of the JIL library can approach theoretical peak performance. > Arguably the most significant recent trend in computer hardware is the eruption of tensor cores in both GPUs and CPUs. Although driven by applications in artificial intelligence, is there a potential for tensor cores to be used for other purposes? After all, GPUs have found many applications beyond their initial ones in the gaming industry. The main motivation of this paper is to study this question for applications in scientific computation and more particularly in computer algebra and computer arithmetic. We will focus on the problem of n> matrix multiplication with -bit integer entries. Tensor cores are designed to accelerate such computations in the case where is large and . Can this be used to efficiently deal with other bit precisions, ranging from small ones like > to larger ones like >? We mainly restrict our attention to the multiplication of square n> matrices. This is indeed most favorable for tensor cores, but also most significant from a complexity perspective, since the complexity of many problems in linear algebra and beyond can be expressed in terms of the complexity of square matrix products. Of course, this restriction will also simplify comparisons, because varying both and will already give rise to numerous different cases, as we shall see. To be fair, we should also compare implementations that exploit tensor cores with alternative ones that are based on more conventional vector arithmetic. For this reason, we chose to focus on recent processors which support both AMX matrix products and AVX512 vector arithmetic with IFMA. Using AMX, each core can do a64\16>> matrix FMA (MFMA) A\B> in about cycles, where and (of dimensions 64>> and 16>>) have -bit entries and (of dimension 16>) has 32bit entries. Roughly speaking, an integer FMA(IFMA) allow us to compute x\y> in one cycle, where and are -wide vectors of -bit integers and is an -wide vector of bit. Note that we preferred the terminology MFMA over GeMM and (general matrix multiplication \*A*B+\*C>) because AMX only supports positive accumulation. In this paper, we focus on the performance of implementations that use a single core of our CPU. This simplification allows to measure the relevance of different implementation strategies that exploit different hardware features (IFMA or AMX), without being distracted by mostly orthogonal considerations on how to scale up these base implementations on multi-core architectures. All our timings will be relative to theoretical peak performance. For instance, for anaive implementation of n> matrix multiplication using >> IFMAs, we will report on the cost of the matrix multiplication divided by the theoretical cost to perform>> IFMAs. These ratios are fairly independent of the particular machine on which we execute our programs and indicate the quality of our implementation. Quotients close to one mean that we could not have done much better for the algorithm at hand. Higher ratios indicate potential problems such as suboptimal memory access patterns or asymptotically subdominant costs that become high for certain sizes. In an ideal world in which all ratios are one, the theoretical complexity analysis of our methods would correspond perfectly to their practical performance. For completeness, we also added an appendix with absolute timings, both for our new implementation and some existing software. All implementations in this paper are done inside or on top of the C++ library JIL for computations with straight-line programs (SLPs). This work is the first step of aproject to the JIL library with fast linear algebra over the integers and fixed point numbers. We would like to cover all ranges of matrix dimensions and bit-sizes of the integers, while taking advantage of recent hardware features such as IFMA andAMX. Conversely, matrix multiplication is an excellent problem for extending JIL beyond the mere compilation of SLPs. Another important motivation behind the present work is to use matrix multiplication as a case study for how to integrate SLPs inside larger pieces of code such as multiple loops. The integration of AMX instructions into the SLP framework is also a challenge. Multiple precision integer multiplication has a long history and there exist many algorithms for various bit precisions, such as the traditional schoolbook method, Karatsuba's algorithm, Toom's method, and modular methods5>. For high precisions, there are also methods based on the fast Fourier transform, but we will not use them in this paper. GMP and MPFR are the reference libraries for multiple precision integer and floating point arithmetic. We also refer to for some recent work on this topic in the HPC context. In the context of integer matrix multiplication, fast multiple precision methods tend to become useful at already very low precisions. This is due to the fact that all known methods can be regarded as or methods. Roughly speaking, multiplying two n> matrices with *p|)>>-bit entries is reduced to |)>> multiplications of n>> matrices with -bit entries, where the quantity |)>> depends on the method (this will be detailed in Section and Table below). The cost of this reduction scales with >, whereas the cost of the |)>> low precision matrix multiplications scales with >. Consequently, the cost of the reduction becomes negligible for large and, for a given >, it is recommended to pick the method for which |)>> is minimal. Good practical implementations should reflect this fact and this is indeed what our work aims at. Theoretically speaking, n> matrices can be multiplied in time>|)>> with\2.371552>> for large . However, among the asymptotically faster algorithms with\3>, only Strassen's algorithm with =log 7> has allowed for some modest practical gains so far. The recent trend to accelerate matrix multiplication directly in the hardware leads to more significant practical gains. (In fact, these hardware accelerations make the practical complexity of matrix multiplication behave as if > were well below. The potential impact of this observation is another motivation behind our work.) Several works and research implementations specifically address the challenges raised by Intel's AMX technologies. also contains a bf16 kernel () and GGML an int8 kernel interleaved by encoding and decoding of quantized tensors (). Other types of CPUs integrate tensor accelerations in different ways. For instance, Apple's secret matrix coprocessor in M1\UM3 CPUs is based on fast \Pouter products\Q, for which separate techniques have been developed. We refer to for an overview of different types of accelerators for matrix multiplication in recent CPUs and to20> for details about Intel's AMX units. The use of tensor cores to multiply matrices of floating-point numbers of precisions beyond the word length of the accelerator is studied in. Georganas propose a batch-reduce kernel that may also be useful for this purpose. Before the emergence of tensor cores, classical reference implementations for matrix multiplication and other routines from linear algebra include , , , and . We refer to for some of the underlying scientific background. For matrices with multiple precision integer entries, the Chinese remainder theorem (CRT) is typically used to reduce matrix multiplication to modular matrix multiplication, which is then performed using a standard BLAS. When working with very high precisions, FFT-based techniques are also used. The library performs well on a large spectrum of matrix dimensions and integer sizes. We refer to for SIMD-accelerated modular arithmetic and to for CRT-based matrix multiplication on modern hardware. Doubling the precision of a BLAS has been considered in. Our work on JIL also draws inspiration from the compilation of codelets for critical tasks as pioneered in and . Matrix multiplication has also been studied from a similar compilation-oriented perspective. We report on an experimental HPC implementation of integer matrix multiplication for a wide range of matrix dimensions and bit sizes of integers. We have spent a comparable amount of effort on exploiting two types of hardware acceleration, namely the IFMA and AMX extensions in recent Intel processors. In Table, we presented an idealized complexity model for large matrix dimensions as a function of the bit size of the coefficients and the integer multiplication method being used. Although the underlying methods are not new, we expect this synthesis to be useful as a guideline for implementors. Our implementations indeed often come close to the theoretical peak performance for which our idealized model is most relevant. This holds especially for our IFMA-based implementation and a bit less for the AMX-based one. For \Plarge\Q n> matrices, say 128>, we observe that AMX-based algorithms outperform the IFMA-based ones. Although the gain is modest for now, we expect that it can be improved with more work. Another potential advantage of AMX is that we have a more fine-grained control over the bit-precision. Our implementations also outperform the current reference libraries and NTL. Throughout this paper, we will write \|}>> for the set of positive integers, including zero, and =,2-1|}>> for the set of unsigned -bit integers. Given aring> and \>, we write > for the set of vectors ,\,a|)>> of length with entries in >. We will also write > for the set of polynomials of degreen> in over >. We will assume the natural dense representation for such polynomials +\+a*x> by their vectors ,\,a|)>> of coefficients. We formally extend the polynomial notation to the case when is replaced by an integer radix , typically of the form >> for some \\>. This allows us to rewrite unsigned integers in \>> as \Ppolynomials\Q +i*B+\+i-1>*B-1>\\>>|]>>> in the radix. Following terminology from GMP, it is customary to call the coefficients |i,\,i-1>|\>\\>> . If >> is a machine integer type, then we may represent an integer \>> by its vector ,\,i-1>|)>> of limbs. For dimensions \>, we also write c>> for the set of c> matrices with entries in>. Unless stated otherwise, we will always assume that matrices |)>\\c>> are represented in row-major order on a computer, as vectors ,\,a,\,,\,a>|)>>. By \Ps\c> matrix multiplication\Q, we understand the multiplication of an s> matrix with an c> matrix into an c> matrix. Let > be the machine precision (or one of the available machine precisions). Standard arithmetic operations on >> naturally extend to SIMD vectors of width in >>. Whenever a function that only involves basic arithmetic operations needs to be evaluated exactly times, this can be done replacing all operations by their SIMD versions. For instance, given a matrix |)>\\>2>> and a vector |)>\\>>, consider the function that computes the matrix-vector product *b+a*b,a*b+a*b|)>\\>>. If we need to evaluate this function times, then the simplest solution is to present the input as amatrix |)>\>|)>2>> and a vector |)>\>|)>>, in which case the exact same formula *b+a*b,a*b+a*b|)>\>|)>> yields all matrix-vector products in SIMDformat. We call this way of vectorizing the . This works best whenever agiven function needs to be evaluated a multiple of times. This also assumes that all data types are replaced functorially by their vectorized versions. For instance, the standard vectorization of the matrix type >c>> is>|)>\>>. The standard vectorization of a multiple precision integer type *\>\\>>|]>>> consists of working with \Plimbs\Q that are vectors in >>. Hence SIMD multiple precision integers \*\>> are represented as +a*2>+\+a-1>*2-1|)>*\>\\>>|]>>>> with ,\,a-1>\\>> and>>. Note that this may not seem to be most \P\Q, since one might prefer to represent \*\>> as,\,a|)>>> with ,\,a\\*\>>>. SIMD arithmetic can often be used as well for evaluating a function only once. For instance, the multiplication A*B> of two matrices \s>> and \c>> can benefit from SIMD acceleration whenever *w> is a multiple of the SIMD width . Indeed, it suffices to reinterpret as a matrix \|)>>> and as a matrix \|)>>>>, while casting to a matrix \|)>s>> by repeating each of its entries times. After that, we can compute =*> in the pure SIMD mode. However, these reinterpretations require some gymnastics at least, and exploiting SIMD arithmetic becomes even more delicate when none of the dimensions are a multiple of . We will come back to this in Section below. In this paper, we mainly report on relative timings with respect to the theoretical peak performance. The only exception is the appendix, which contains absolute timings, as well as timings for existing software. All benchmarks were obtained on an processor with 32 cores and running at a clock rate of 2.2 GHz. Our timings rely on the instruction for measuring time in terms of clock cycles. Recall that all our algorithms use only one core of the CPU. Let > be our favorite (and supported) machine bit precision for integer arithmetic, like =32>> or =64>. The machine representation of an integer +a*2>+\+a-1>*2-1|)>*\>\\>>|]>>> is the vector ,\,a-1>|)>> of\Pits\Qlimbs. The hardware is assumed to provide an instruction for multiplying two limbs \>>>, with a result in >> (sometimes, the low and high parts need to be computed separately). For small precisions, multiple precision libraries typically implement integer multiplications +\+a-1>*B-1>|)>*+\+b*B|)>> by cross multiplying all limbs*b> and summing the results while handling carries appropriately. Unfortunately, carry propagation is unpleasant to vectorize and SIMD units typically do not provide \Pvector carry registers\Q (except for certain GPUs that handle this via mask registers). In a SIMD world, it is therefore more convenient to work with representations: we still represent integers as polynomials +a*B+\+a-1>*B-1>> in some radix 2>>, but the limbs ,\,a-1>\\|\>>> are stored with a higher precision|\>> (which may be different for multiplicands and products). If ,\,a-2>>\\>>, then \+|\>-\>>> and the representation is said to be . We will call > the and |\>> the . The difference |\>-\> is the number in the terminology of GMP. On recent Intel Xeon processors there are two instructions > and > for (IFMA). Formally, given \> and \>, we have <\eqnarray*> >|>|b rem 2|)>|)> rem 2>>|>|>|b quo 2|)>|)> rem 2,>>>> where and denote the quotient and the remainder of the euclidean division of by . The is the redundant representation with =52> and|\>=64>. Given a two-limb integer +c*2>\\+|\>>>, we can accumulate a*b> the product of two limbs \>> using one > and one >. As long as ,c\2|\>>-2>>>, this computation does not overflow, but typically does not remain normalized. On even more recent Intel Xeon processors with AMX support, there is an instruction for matricial FMAs A*B>, where \64>>, \16>>, and \16>>. For individual integer coefficients of the product, this corresponds to working with a redundant with =8> and |\>=32>. Note that products of limbs are two limbs long for the IFMA representation, but only one limb long for the AMX representation. Indeed, when using AMX, products and multiplicands are represented using limbs of different -bit and -bit capacities, respectively. Redundant representations have the advantage that the |\>-\> nail bits can be used to accumulate carries, while postponing normalization to the very end. Consider for instance the multiplication of +\+a-1>*B-1>> and +\+b*B> using integer FMAs, where ,\,a-1>,b,\,b\\> and >. Assume also that ,m|)>\2|\>-\>=4096>. Then we first compute the non-normalized product +\+c+m-1>*B+m-1>> with ,\,c+m-1>\\>, and then normalize it, as follows: <\equation*> ||||||||||>>||||>>||>|>|>|,b|)>>|>|>|>|+ quo 2|)>>>|>|>|,b|)>>||>|>| rem 2>>|>|>|,a,b|)>>||>|>|+ quo 2|)>>>|>|>|,a,b|)>>||>|>| rem 2>>|>|>|,b|)>>|||>|>||>|||+m-1>>|>|+m-1>++m-2> quo 2|)>>>|+m-1>>|>|-1>,b|)>>||+m-2>>|>|+m-2> rem 2>>>>> This algorithm has been represented graphically in Figure; the lower and upper triangles correspond to > and > operations, respectively. If =m>, then the mere highest > limbs of the result can be computed similarly. With the naive algorithm, the right picture in Figure shows that this can be done at roughly half the cost of a full multiplication. This is very useful for fixed point arithmetic, by representing unsigned >limb fixed point numbers as >-limb integers divided by >>. It is interesting to note that the |\>-\> nail bits can be exploited to represent small integer parts in this case. However, our current implementation is limited to unsigned fixed point numbers. Signed variants can be obtained by representing signed integers |a,b|\>\\\-1>>> by \a+2-1>> and \b+2-1>> in >>, so that *-+|)>*2-1>+2-2>>>. Alternatively, one may use a two-complement representation for the highest limb and adapt multiple precision multiplication accordingly. But it becomes harder to exploit the extra |\>-\> nail bits in both cases.|gr-frame|>|gr-geometry||gr-grid||gr-grid-old||1>|gr-edit-grid-aspect|||>|gr-edit-grid||gr-edit-grid-old||1>|gr-grid-aspect|||>|gr-grid-aspect-props|||>|gr-text-at-halign|center|gr-text-at-valign|center|magnify|0.840896415||>||>||>||>||>||>||>||>>||>||>||>||>||>||>||>||>||>||>>||>>||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>||>>||>>||>>|-1>|>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>||>>||>>||>>|+m-1>|>>||>||>>||>>||>>|-1>|>>||>>||>>|-1>|>>||>>||>>||>>||>>|-1>|>>||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>||>||>||>||>||>|||>||>||>||>||>||>||>>>> Illustration of a full multiple precision integer product (on the left) and a truncated high product (on the right) using integer FMAs (the lower yellow triangles and the upper orange triangles correspond to > and >, respectively). > We briefly discussed pure SIMD vectorization of multiple precision integers in Section. The idea to represent such integers as polynomials in >> with vectorial limbs also works for redundant representations. This allows us to apply the algorithms from the previous subsection directly to pure SIMD multiple precision integers in |]>>>. In a similar way, matrices of multiple precision integers in >c>> can be rewritten into polynomials in c>|]>>> with matrix limbs in c>>. Computing a non-normalized B> with \64>|]>>>, \16>|]>>>, and \16>|]>-1>> can then be done naively using > AMX-accelerated matrix multiplications 64>\\16>\\16>>. Note that we do not compute low and high parts separately in this case. When doing high >-limb multiplication in this way, this means that we need to do *+1|)>/2> multiplications, which is slightly more than one half of the number for full products. For large n> matrix multiplications with >-limb coefficients, using a naive algorithm for polynomial multiplication becomes suboptimal. If \1> is small, then it becomes better to use Karatsuba's algorithm. Given a ring >, the traditional Karatsuba trick reduces one multiplication with \> into three multiplications in > and several additions and subtractions using the formulas =P*Q>, =P*Q>, and =+P|)>*+Q|)>-R-R>. This trick generalizes to the case where > is replaced by >> and by >>, in which case the multiplication of two polynomials |P,Q|\>\\>\\>>|]>> essentially reduces to three multiplications of polynomials in >>. These latter polynomials can be computed recursively using the same trick. When working with multiple precision integers \>>|]>> one technical difficulty is that +a> and +b> are in +1>> and not necessarily in >>. This makes it better to work with limbs of bit-size -1> instead of >. If our input integers used >-bit limbs, then this requires them to be rewritten into the -1|)>>-bit limb representation and for the output. When using Karatsuba's algorithm recursively for >>-limb integers, we may either directly use limbs of bit-size -\> (which yields a multiplication algorithm for >*-\|)>>-bit integers), or do the limb rewritings recursively as well (which yields amultiplication algorithm for >*-2|)>+2|)>>-bit integers). Karatsuba's trick can also be applied to matrices \-\>n>>|]>>>>. In that case, we need to do >*n> multiplications in >>, but only >*n|)>> additions and subtractions, which becomes negligible for large . Asymptotically, we thus gain a factor >/3>> with respect to naive integer matrix multiplication. Toom generalized Karatsuba's trick and gave a way to reduce the multiplication of two polynomials in >> to -1> multiplications in > at the cost of an increased loss of bit-precision for the integer variant. Interestingly, Karatsuba's trick can also be used directly for \>>, while losing a single bit of precision. The idea (see also4.4.1>, , 1.4>, and) is to observe that the contribution of *Q+P*Q> to > can be computed using *Q+P*Q=+P|)>*+Q|)>-P*Q-P*Q>, for all i\j\\>>. In this way, we need to first compute the > products *Q,\,P-1>*Q-1>> and then *-1|)>/2> further products of the form +P|)>*+Q|)>>, which yields a total number of *+1|)>/2> products. If we just want the high coefficients -1>,\,-1>>, then we only need to do +|\/2|\>*|\/2|\>\\/4> multiplications. This is actually just as good as Toom's algorithm for =3> and almost as good for =4> and =5>. A well known method for multiplying two matrices \*\>n>> is based on the Chinese remainder theorem. We first pick > pairwise coprime moduli ,\,m-1>\\>>. Best is to choose them as large as possible while fitting into > bits. Let m*\*m-1>> and assume that \*\-\>> for some small \0> with 2*\-\>>. In order to compute , we first reduce and modulo > for ,-1>>. There is an elegant way to evaluate the map *\>\\>>;a\|)>2*\>> using avector-matrix product: <\eqnarray*> >|>|-1>>>>>>>|>|>|>|-1>>>>>>*|>|>|> rem m>||> rem m-1>>>|>||>>|-1|)>*\> rem m>|>|-1|)>*\> rem m-1>>>>>>>>>> <\eqnarray*> >|>|-1>>>>>>>|| rem m>|>|-1> rem m-1>>>>>>.>>>> Taking the limbs of the entries of as our rows ,\,a-1>|)>>, the computation of > for ,2*\-1> thus essentially reduces to an \\\|)>> matrix product over >> and similarly for . We next compute <\equation*> C rem m\|)>*|)> rem m for ,2*\-1>. We finally recover from ,\,C rem m-1>> by applying the Chinese remainder theorem. Consider the map :\>>\\*\>;|)>2*\>\x> with x\M> and =r> for ,2*\-1>. We have <\equation*> x=>*u*r+\+-1>>*u-1>*r-1>, where > is the inverse of > modulo > for ,2*\-1>. After precomputing the >>adic expansions >*u=\+\+a-1>*2-1|)>*\>>, it is again possible to compute > using a vector-matrix product: <\eqnarray*> >|>|-1>>>>>>>|>|>|>|-1>>>>>>*>|>|-1>>>|>||>>|-1,0>>|>|-1,2*\-1>>>>>>>>|||+\+-1>*2-1|)>*\>|)> rem M.>>>> In a similar way as above, the recovery of from ,\,C rem m-1>> thus essentially boils down to an \|)>\|)>> matrix product over >>. In order to ease the computation of the remainder with respect to , it is sometimes preferred to first compute the remainders of *r> modulo > and then to work with the >>-adic expansions of > instead of *M/m>. In this way, the quotient is always bounded by>>. Each of the schemes that we described so far can be regarded as a way to reduce the multiplication of two matrices in *\>n>> to |)>> multiplications >*B>,\,A-1|]>>*B-1|]>>> of matrices in>n>>, through the use of an appropriate encoding: <\equation> |gr-frame|>|gr-geometry||gr-grid||gr-grid-old||1>|gr-edit-grid-aspect|||>|gr-edit-grid||gr-edit-grid-old||1>|gr-grid-aspect|||>|gr-grid-aspect-props|||>|gr-arrow-end|\|magnify|0.840896415||)>-1|]>>|>>|>|>>||)>-1|]>>|>>|>|>>|>>|>|>>|>|>>|>>|>|>>|>|>>||)>-1|]>>|>>|>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|||>>|>|>|>>>> The cost of the reduction may depend on >, but scales linearly with the size > of the matrices. Since the cost of |)>> multiplications of matrices in >n>> is proportional to |)>*n>, the cost of the reduction is negligible for large . For large matrices, it is therefore best to select the scheme for which |)>> is minimal. The encoding/decoding process may also induce a loss of precision, which should be kept minimal as well. Table summarizes the factors |)>> and the loss of precision for the methods from the previous sections. For our implementations, we only considered the naive method, Karatsuba's method, and Chinese remaindering. In certain cases, Toom's algorithm or FFT-based algorithms may also be of interest.|||||||||||||||||||||||||||||>>|>|>||||||||||||||>||||||||||||||>>>|>|||||||||||||> or *\>>>||||||||||||||>>>||||||||||||||>||||||||||||||>>||||||||||||||>>>||||||||||||||>>>||||||||||||||>||||||||||||||>||||||||||||||>>>||||||||||||||>>>>>>>> Overview of the factor |)>> as a function of > for various multiplication schemes, as well as the corresponding losses in bit precision. For CRTs, the bit loss is =1> for =52> or =8> and \8>, but =5> for =8> and =16>. The first four rows correspond to full multiplication, whereas the other rows correspond to high multiplication. The last three rows counts the number of IFMA operations that are required if low and high machine products need to be computed separately. For Karatsuba multiplication, we do not recurse, except for Karatsuba, which recurses once. > In this section, we consider the problem of multiplying \Psmall\Q integer matrices. Since AMX-acceleration is mainly interesting for quite large multiplications (of size at least 64\16>), we focus on SIMD-accelerated IFMA-based algorithms. Since we assume both and > to be \Psmall\Q, we will use naive algorithms for both matrix and polynomial multiplication. Here the qualifier \Pnaive\Q is meant to capture the family of all algorithms that perform *\> pairs of >/> integer FMAs in total. However, there are many such \Pnaive\Q algorithms and the precise order in which we do perform the integer FMAs matters, since this crucially influences how well we can keep coefficients in registers and how much pressure we put on registers. Instead of implementing various multiplication strategies directly in a language likeC++, our strategy is to rely on the JIL library for computations and JIT compilation of straight-line programs (SLPs): we implement a large amount of strategies and ways to combine strategies as symbolic programs. The JIL library provides highly efficient mechanisms to record the execution of these programs as SLPs and to JIT compile the resulting SLPs into highly efficient machine code. The recording plus compilation time is typically about 1000 times larger than a single execution of the generated code, which is one to three orders of magnitude faster than traditional compilers. Let us illustrate this with a short example. We recall that an SLP is just a sequence of arithmetic operations. The operations belong to a fixed signature like =,fma|}>> and we operate on a finite number of data fields that are all of the same type. This type could either be a scalar type like > or FP32, or aSIMD vector type like >. A finite number of the data fields are designated to be \Pinput\Q and \Poutput\Q fields. <\vgroup> Now given a generic function operating on data fields of the same time, but which may involve loops and function calls, JIL provides a way to record the trace of its execution as an SLP. In Figure, we illustrated the result of recording a C++ template for 2\2> matrix multiplication as an SLP. When implementing multiplication algorithms in this way, the focus is not on the design of efficient C++ code templates, but rather on the design of clean templates, whose recording produces good SLPs. This makes it easy and efficient to build new strategies on top of other ones, like a multiplication algorithm for polynomials in 2>> on top of a multiplication algorithm for matrices in 2>>. JIL also provides standard mechanisms for composition and tensor products of maps, reindexation, etc. We refer to for more information.||| <\cpp-code> |typename C\ void>>|>|2; k++)>>|2; i++)>>|2; j++)>>|>|>|>|>|>>>>> ||record>>>|<\cell> ||||,a,\,b,b|)>>||>|>|>|\b>>|>|>|\b>>|>|>|\b>>|>|>|\b>>|>|>|,b,c|)>>>|>|>|,b,c|)>>>|>|>|,b,c|)>>>|>|>|,b,c|)>>>|,c,c,c|)>>||>>>>> >>>>> Illustration of the recording process of a program as an SLP. For the purpose of readability, we have chosen nice names ,a,\,c> for the data fields of the SLP. Inside JIL, the data fields are rather referred to through indices ,11>. > <\vgroup> Let us first consider the problem of multiplying integer matrices in the pure SIMD mode, using IFMA-based arithmetic. This means that our multiplicands are n> matrices in|]>>|)>n>> with coefficients that are SIMD multiple precision integers in |]>>>, whose limbs are SIMD vectors in >. Two things matter for the design of efficient \Pnaive\Q algorithms. We first should decide whether we view our multiplicands as matrices in >|)>n>> with SIMD integer entries or as polynomials in |)>n>|]>>> with SIMD matrix \Plimbs\Q. The first point of view tends to be faster when > is large with respect to, whereas the second one tends to be faster when is large with respect to >. Secondly, divide and conquer algorithms tend to reduce register spilling. For instance, s\c>> matrix products can recursively be reduced to smaller matrix products by halving ,, or. However, recursing too far down may increase register pressure. We implemented an strategy that recurses until a certain threshold is reached (namely ). In the future, we plan to automatically generate many strategies of this kind and select the best one. After implementing the desired multiplication strategy as a in JIL, we record one evaluation of this map as an SLP, and JIT compile the SLP into machine code. <\remark> It is interesting to note that the SLP framework allows us to work with aspecific memory layout (say n> matrices with coefficients in >>), while applying an algorithm that works with another representation (like polynomials in > with coefficients in |)>n>>). Indeed, the alternative representation just corresponds to a permutation of coefficients in memory. JIL provides an abstraction (the type in ) that allows to automate this kind of re-interpretations (and which we indeed used for the implementation described in this section). Table shows the ratios of our timings for an \\\2*\> limb long n> matrix multiplication divided by the theoretical time to the theoretical time to perform *\>> integer FMAs over >> (in theory, our processor has a throughput of two integer FMAs per cycle). The left-hand table shows the ratios for the standard representation; the right-hand one concerns the alternative representation of our multiplicands as polynomials in the radix > with matrix coefficients. Interestingly, the observed ratios sometimes drops the theoretical peak performance. This is probably due to subtleties with cycle counters in recent CPUs, in which different cores may run at different clock speeds. It might also indicate that the processor can sometimes pipeline IFMA instructions a bit better than officially announced. We also recall that the generated codelet unrolls the entire computation. For =16>>, this means that the generated program performs > IFMAs. The codelet therefore takes more than 12Mb in memory and does not even fit in the L3 cache. This explains the dramatic increase in the ratios for large and/or >.||||||||||||||||||||||||||||||||||||||||||||>|)>n>>>|||||||||>|>|||||||||>>>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>>>>||||||||||||||||||||||||||||||||||||||||||||||)>n>|]>>>>|||||||||>|>|||||||||>>>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>>>>>>> Pure SIMD integer matrix multiplication using a naive algorithm and IFMA. We show the ratios with respect to the theoretical peak performance of two IFMAs per cycle. > Let us now turn to the computation of a single n\n> matrix product over >>. In Section, we already described a reduction to the computation of a n\> matrix product =*> over >>, whenever is divisible by . We may then use an algorithm in pure SIMD mode to compute this product. In fact the situation is even a bit better than that: instead of casting into a matrix \>|)>n>>, it is better to regard directly as an > SIMD matrix over >>, and to replicate individual coefficients of across lanes whenever needed. This more compact representation of has the advantage that we may typically keep wholly in registers, after which obtaining coefficients from does not involve memory accesses. If is not divisible by , then more work is required to benefit from SIMD : for an s\c> product such that is divisible by , we may write *w>, *w>, and *w> for some ,w,w> with *w*w=8>. We next redundantly encode \>s>>, \>c>>, and \>c>> by matrices \>|)>\>>, \>|)>\>>, \>|)>\>> by taking <\equation> |||)>+k|)>*w+j>>|>|*i+i,w*k+k>>>||)>+k|)>*w+j>>|>|*k+k,w*j+j>>>|>|)>+k|)>*w+j>>|>|*i+i,w*j+j>,>>>>> where i\>, j\>, k\> and i\w>, j\w>, k\w>. Every SIMD multiplication in >c>> then computes a \w\w> matrix product with coefficients in >>. If \1>, then some of the lanes of > need to be summed in order to retrieve . The standard SLP signature in JIL includes instructions for permutations of lanes for SIMD base types (mathematically speaking, really maps on lanes, because replication of lanes is also allowed). Consequently, the tinkered SIMD multiplication algorithms can still be recorded into SLPs and compiled within the JIL framework. As a consequence of Remark, we note in addition that any of the pure SIMD multiplication strategies can naturally be used in conjunction with any of the tinkered encoding strategies. Table shows timings that we obtained, with similar conventions as for Table. This time, we divided the total number of cycles by *\/8> instead of *\>.|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||>n>>>|||||||||>|>|||||||||>>>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>>>>|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||n>|]>>>>|||||||||>|>|||||||||>>>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>>>>>>> Tinkered SIMD integer matrix multiplication using a naive algorithm and IFMA. We show the ratios with respect to the theoretical peak performance of two IFMAs per cycle. > One important application of integer matrix multiplication is linear algebra over fixed point numbers. In that case, we only need to compute the highest > limbs of the product, which saves about half of the computations. Table reports on the obtained timings. We recall that a shortcoming of our implementation is that it is currently limited to unsigned fixed point numbers.|||||||||||||||||||||||||||||||||||||||||||||||||>|||||||||>>>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>||||||||||>>>>>> Tinkered SIMD n> matrix multiplication for > limb fixed point entries and using a naive algorithm and IFMA. We show the ratios with respect to the theoretical peak performance of two IFMAs per cycle. > The simplest way to accelerate a large matrix multiplication A*B> via JIL is to consider ,, and as block matrices. We use JIL for the compilation of efficient SLPs to multiply or multiply-accumulate blocks. The block matrices themselves are multiplied using conventional C++ code. The last possibly incomplete row and column are treated using zero-padding or recursive decomposition. The block sizes should be carefully chosen and may differ for , , and. By taking sufficiently large blocks, we intend to make the overhead of the C++ part of the code as small as possible, so that the total cost essentially boils down to multiplications of blocks. One goal of the above strategy is that JIL can focus on compiling SLPs of a fixed size and that we can use a higher level language like C++ for the rest, with minimal overhead. However, for practical block sizes, the overhead of the C++ code is small, but not necessarily negligible, which makes it desirable to further improve the interface between JIT compiled SLPs and their use within high level C++ programs. For the purposes of this paper, we therefore extended JIL with a new abstraction, which allows us to JIT compile slightly more complex programs than mere SLPs. For instance, an extremely common scenario is to execute the same SLP on a vector of successive input and output chunks. This happens for example when encoding all c>> entries of an integer matrix in >c>> into CRT format. JIL now contains a \Ploop promise\Q to cover this scenario. Our implementation first rewrites the SLP into a new one that can handle entries at a time in SIMD fashion (this requires transpositions in memory to work in the SIMD representation), or even a multiple of entries at a time via unrolling. In the future, we also plan to automatically determine and factor out loop invariants. Back to matrix multiplication, another common scenario is that the input of the SLP has first to be gathered from memory and/or that the output needs to be scattered to memory, as we explained in Section and(). For instance, Chinese remaindering essentially reduces the multiplication of two matrices in >n>> to > multiplications of matrices in n>>. Now it is more natural to CRT encode an input matrix in >n>> as avector of matrices in n>|)>>> rather than amatrix with vector entries in >|)>n>>. The latter case corresponds to the loop scenario that we described above, whereas the first one requires us to submit the output to an additional |)>\n> matrix transposition over>. In order to reduce unnecessary memory accesses, we preferred to introduce a new scenario that scatters entries in >> directly into the > desired matrices n>> while processing the CRTs. We have tested most of the strategies of Table for the multiplication of large matrices using integer FMAs. For the \PMultiply\Q step of(), we use the blocking technique that we described above, and also a tinkered SIMD as in Section for the multiplication of blocks. We report on relative timings with respect to the theoretical peak performance. Tables with absolute timings are given in Appendix. In Table, we consider long \\\2*\> limb multiplications of n> matrices for the naive and the Chinese remaindering (CRT) strategies. More precisely, we show the number of cycles that it takes to compute the matrix product, divided by |)>*n>, the theoretical number of cycles to perform all IFMA operations in the multiplication step of(). In Table, we show timings for the non-recursive variant of Karatsuba's algorithm, both in the case of long \\\2*\> limb multiplication and high \\\\> limb multiplication. The tables show that we are not far from the theoretical peak performance. The factors from Table can therefore guide the design of efficient algorithms when gets large. In particular, it is interesting to note that Karatsuba's algorithm is most efficient for high multiplication until five limbs. There is still quite some room for further improvements. Our implementation of the naive algorithm was done first and based on a suboptimal implementation of matrix transposition for the encoding and decoding steps; this explains the poor performance for small>. In general, with more work, we expect that it should be possible to reach factors that are very close to (at least for, say, 256> and \8>).||||||||||||||||||||||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>>>>> ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>>>>>> Relative timings for one long \\\2*\> limb n> matrix product using IFMA. The left-hand and right-hand sides respectively indicate the relative timings (with respect to theoretical peak performance) for the naive and the CRT strategies. >||||||||||||||||||||||||||||||||||||||||||||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>>>>> ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>>>>>> Relative timings for one long \\\2*\> or one high \\\\> limb n> matrix product (on the left-hand and right-hand sides, respectively) using IFMA and the non-recursive variant of Karatsuba's algorithm. > Let \s>>, \c>>, and \c>>, where 16>, 64>, and 16> are such that is amultiple of four. Then the AMX extension offers an instruction for computing the matrix FMA A*B> in 16 cycles if and somewhat faster if 64>. In practice, the total number of scalar FMAs per cycle is highest when using the maximal allowed dimensions , , and , which is the main configuration used in our . For =>, the matrices , , and need to be loaded from memory and stored in dedicated registers. There are eight available tile registers of 1Kb each and there are dedicated strided load and store operations for tiles. In addition, the second multiplicand is stored in a special VNNI format, which coincides neither with the row major nor column major formats. Since the 1Kb size of tiles is very different from the 64 byte size of AVX512 registers, AMX and AVX512 instructions do not fit well together into the SLP framework ofJIL. For this reason, we decided to start with a base C++ implementation of the (MFMA) A*B>> for arbitrary sizes , using Intel instrinsics. In the future, we plan to extend JIL with special \Ppromises\Q (supplementing the scenarios described in Section) for building code that gracefully mixes AMX and AVX512 instructions. For now, the core of our work on AMX relies on an efficient implementation of MFMA for 8\32> bit entries. On top of this, we implemented multiple precision matrix multiplication using various schemes as in(). The encoding and decoding stages are mostly done using AVX512 instructions and our main concern will be to keep the cost of these stages reasonable. This is challenging due to the fact that the throughput of AVX512 instructions is roughly sixteen times lower than the throughout of AMX instructions (for eight bit entries, we do 64\16> FMAs per 16 cycles with AMX and 64 operations per cycle with AVX512). We again heavily rely on SLPs from JIL and a similar extra layer as described in Section forloops. Our 8-bit MFMA kernel follows a relatively standard design informed by the classic paper by Goto and vandeGeij; see also5.1> and the recommendations of the Intel Optimization Reference Manual20>. Like many implementations of general matrix multiplication on AMX units, it is based on a microkernel that multiplies a 1> tile block \\64>>> from with a 2> tile block \\32>> from and accumulates the result in a 2> tile block\\32>> of, using all eight tile registers. Assuming that this microkernel is run repeatedly with the four accumulation tiles fixed, it requires about four loads and four tile FMAs per iteration. While, in ideal circumstances, the architecture we target can sustain up to two tile loads from L1cache (and one store) per tile FMA, the 2> pattern makes it easier to hide memory movements behind computations even when not all loads hit L1cache. The typical scenario we aim for is for> to reside in L1cache while >resides in L2cache, as it is still possible in that case to fully overlap the loads with the FMAs in the steady state. We access the matrix using non-temporal load instructions so as not to needlessly evict data from from L1cache. In addition, we use the strided load instructions mentioned above to read the 16rows making up a tile of from their original, typically non-contiguous memory locations without first packing them together, and similarly for writing tiles back to. Our experiments suggest that, for data aligned on a cache line and in the absence of cache conflicts, strided loads and stores are not significantly slower than contiguous ones (see however for more on the limits of this strategy). We deal with odd numbers of tiles in either dimension using a variant of the same microkernel, and with incomplete - or tiles using a temporary buffer and AVX512 masked loads and stores. The microkernel is wrapped in loops over (from outermost to innermost) that multiply a series of t> tile blocks of each fitting in L1cache by a single block of that fits in L2cache. On a single core, we can typically process blocks of about 16\64> tiles in this fashion. The resulting macrokernel is itself wrapped in a cache-blocking kernel, using the same loop order, that repacks roughly L2-sized blocks of into contiguous VNNI-encoded full tiles (padded with zeros if necessary) before processing them using the macrokernel. The encoder operates on blocks of up to 32> entries of, corresponding to two adjacent tiles, using AVX512 instructions. Slightly surprisingly, we found this block format to perform better in our context than the alternative of64> entries, even though the latter makes it possible to load full cache lines (rows of the 64> block, to be split among four tiles) at once. We speculate that this may be because the L1cache can serve up to three 256-byte loads but only two 512-byte loads per cycle. We also implemented some microkernels dedicated to the special cases where one of the dimensions of the product is less than or equal to one tile. <\remark> Endo, Ohshima and Nanri recently proposed an alternative microkernel that reserves six tile registers for accumulating a 3> block of and uses the remaining two to hold the current tiles of and. The tiles of are loaded once per block and those of are loaded twice. While this results in an fma/load ratio of only , the idea is that two of the loads are virtually guaranteed to hit L1 cache, whereas, depending on the blocking strategy, all four of the tiles accessed by the 2> kernel may have been evicted by the time they are reused. While Endo report reaching better performance with their 3> kernel than with the standard one, in our experiments, it performed worse except for . Endo's code does not appear to be publicly available, but a plausible explanation would be that the use of non-temporal tile loads makes their technique unnecessary. The most natural approach to build an algorithm for >-limb matrix multiplication on top of the work from the previous subsection is to rewrite n> input matrices as elements inn>|]>>>. Then a long multiplication boils down to > matrix FMAs and a short (low or high) multiplication to *+1|)>/2> matrix FMAs. Table shows the factors between the performance of our implementation and the theoretical peak performance. If one is satisfied with a factor of three or less, then our current implementation will do. (In Table of the appendix, we can see that the naive AMX-based implementation slightly outperforms our IFMA-based one.) But it seems significantly more difficult to approach the theoretical peak performance as well as we did for IFMA-based methods. The variance of the timings obtained with AMX is quite large, with frequent outliers that can significantly alter mean timings. For this reason, our tables show the median instead of the mean timings over a large number of runs. The cost of encoding and decoding using AVX512 is far from negligible. It seems hard to entirely get rid of this cost, since the algorithms from the previous subsection benefit from the fact that most of the matrices have a natural memory layout. Nonetheless, by interleaving the AMX instructions appropriately with the encoding and decoding, we expect that the factors can be reduced significantly. As increases, the cost of decoding and encoding should become smaller with respect to the cost of the inner matrix multiplications. This is indeed what we observe for small values *\> (say *\\2>). However, for larger values of *\>, we start to suffer from cache misses. In the future, we plan to reduce these misses through a more careful interleaving of memory loads and stores with the actual computations. But again, part of the slow-down is probably unavoidable. ||||||||||||||||||||||||||||||||||||||||||||||||>|||||||||||>||||||||||||>||||||||||||>||||||||||||>||||||||||||>||||||||||||>||||||||||||>||||||||||||>||||||||||||>||||||||||||>||||||||||||>||||||||||||>||||||||||||>||||||||||||>>>>> <|big-table> Factors with respect to theoretical peak performance for multiplying two n> matrices with bit positive integer coefficients. The columns Np corresponds to a p\p> bit low product, whereas the columns Lp corresponds to a p\2*p> bit long product. The column AMX corresponds to the raw 8\32> bit product from the Section. > The implementation of the Chinese remaindering approach from section for AMX-accelerated kernels is work in progress. We are currently working on an implementation in which the transforms() and() are also done using AMX instructions, by replacing the left-hand rows by huge matrices with one row per object to be transformed. Unfortunately, this means that two of the three matrices are far from being square, and AMX instructions are less efficient for such matrices. In addition, for various interesting bit-sizes like 64\128> or 128\256>, we cannot plainly profit from 64\16> multiplications and we have to resort to smaller configurations, such as 16\16> or 32\16>>. For larger bit-sizes, we also run out of 8 bit primes. This forces us to resort to 16 bit primes, for which we pay a factor two (or to 14 bit primes with Karatsuba multiplication, which still leads to an overhead of >). In addition to the matrix multiplications, one also needs to do several modular reductions, overlapping sums, and permutations in memory using AVX512 instructions. Altogether, this makes the algorithms time consuming and tricky to implement. For now, we only completed the implementation of the direct transforms. Table contains some absolute timings. Comparing with Table, we observe that the required CRT for a 1024> matrix product over > is about 16 times faster than the current naive product. Since we need to do two direct CRTs and one inverse CRT (typically of doubled cost) and four times less modular matrix multiplications, we expect to gain a factor two in this case. The threshold for CRTs is thus around 1024>.||||||||||||>|16>>|16>>|32>>|32>>|\N>>>||\s>|\s>|\s>|\s>|>||\s>|\s>|\s>|\s>|>||\s>|\s>|ms>|ms>|>||ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|>|||ms>|||>>>>>> Absolute timing for > direct CRT on >-bit numbers and moduli, using the reduction to an \\\N> matrix product. > <\bibliography|bib|tm-plain|> <\bib-list|67> A.Abdelfattah, J.Dongarra, M.Fasi, M.MikaitisF.Tisseur. Analysis of floating-point matrix multiplication computed via integer arithmetic. 2026. A.Ahlbäck, J.vander HoevenG.Lecerf. JIL: a high performance library for straight-line programs. , 2025. Automatically Tuned Linear Algebra Software (ATLAS). . J.Berthomieu, S.Graillat, D.LesnoffT.Mary. Multiword matrix multiplication over large finite fields in floating-point arithmetic. , 2026. D.BiniV.Y.Pan. . Birkhäuser Boston Inc., Boston, MA, 1994. Fundamental algorithms. BLAS (Basic Linear Algebra Subprograms). . BLAS-like library instantiation software framework. . J.Bradbury, N.DruckerM.Hillenbrand. NTT software optimization using an extended Harvey butterfly. , 2021. . R.P.BrentP.Zimmermann. . Cambridge University Press, 2010. P.Cawley. Apple AMX instruction set. , 2024. J.W.CooleyJ.W.Tukey. An algorithm for the machine calculation of complex Fourier series. , 19:297\U301, 1965. W.da Silva Pereira. Accelerating floating-point computations with Intel AMX. NREL/TP-2C00-93622, National Renewable Energy Laboratory, 2025. J.Doliskani, P.Giorgi, R.LebretonÉ.Schost. Simultaneous conversions with the Residue Number System using linear algebra. , 44(3), 2018. Article27. J.J.Dongarra, J.Du Croz, S.HammarlingI.Duff. A set of level 3 basic linear algebra subprograms. , 16(1):1\U17, March 1990. J.J.Dongarra, J.Du Croz, S.HammarlingR.J.Hanson. An extended set of FORTRAN basic linear algebra subprograms. , 14(1):1\U17, March 1988. J.-G.Dumas, T.Gautier, M.Giesbrecht, P.Giorgi, B.Hovinen, E.Kaltofen, B.D.Saunders, W.J.TurnerG.Villard. Linbox: a generic library for exact linear algebra. , 40\U50. Beijing, China, 2002. J.-G.Dumas, P.GiorgiC.Pernet. Dense linear algebra over word-size prime fields: the FFLAS and FFPACK packages. T.EdamatsuD.Takahashi. Accelerating large integer multiplication using intel AVX-512IFMA. , 60\U74. Springer-Verlag, 2019. Y.Endo, S.OhshimaT.Nanri. Optimization of a GEMM implementation using Intel AMX. , SCA/HPCAsia '26, 81\U90. ACM, 2026. D.Filho, G.BrandãoJ.López. Fast polynomial multiplication using matrix multiplication accelerators with applications to NTRU on Apple M1/M3 SoCs. , 0, 2024. P.Fortin, A.Fleury, F.LemaireM.Monagan. High performance SIMD modular arithmetic for polynomial evaluation. , 2020. M.Frigo. A fast Fourier transform compiler. , PLDI'99, 169\U180. New York, NY, USA, 1999. ACM. M.FrigoS.G.Johnson. The design and implementation of FFTW3. , 93(2):216\U231, 2005. J.vonzur GathenJ.Gerhard. . Cambridge University Press, New York, NY, USA, 3rd, 2013. E.Georganas, K.Banerjee, D.Kalamkar, S.Avancha, A.Venkat, M.Anderson, G.Henry, H.PabstA.Heinecke. High-performance deep learning via a single building block. 2019. G.Gerganov etal. Ggml. , 2026. K.GotoR.A.van de Geijn. Anatomy of high-performance matrix multiplication. , 34(3), 2008. T.Granlund etal. GMP, the GNU multiple precision arithmetic library. , 1991. S.GueronV.Krasnov. Accelerating big integer arithmetic using intel IFMA extensions. , 32\U38. 2016. J.A.Gunnels, G.M.HenryR.A.van de Geijn. A family of high-performance matrix multiplication algorithms. , 51\U60. Berlin, Heidelberg, 2001. Springer. G.Hanrot, V.Lefèvre, K.RydeP.Zimmermann. MPFR, a C library for multiple-precision floating-point computations with exact rounding. , 2000. William Hart etal. FLINT: fast library for number theory. , 2010. D.HarveyJ.vander Hoeven. On the complexity of integer matrix multiplication. , 89:1\U8, 2018. D.HarveyJ.vander Hoeven. Integer multiplication in time >. , 193(2):563\U617, 2021. D.HarveyA.V.Sutherland. Computing Hasse\UWitt matrices of hyperelliptic curves in average polynomial time. XI)>, 17, 257\U273. Cambridge University Press, 2014. J.vander Hoeven. Relax, but don't be too lazy. , 34:479\U542, 2002. J.vander Hoeven. . Scypress, 2020. J.vander HoevenG.Lecerf. Faster FFTs in medium precision. , 75\U82. June 2015. J.vander HoevenG.Lecerf. Implementing number theoretic transforms. , HAL, 2024. , accepted for publication in AAECC. J.vander HoevenG.Lecerf. Towards a library for straight-line programs. , 2026. . J.vander Hoeven, G.LecerfG.Quintin. Modular SIMD arithmetic in Mathemagix. , 43(1):5\U1, 2016. Intel Corporation. Intel\ 64 and IA-32 architectures optimization reference manual: volume 1. 2024. A.KaratsubaJ.Ofman. Multiplication of multidigit numbers on automata. , 7:595\U596, 1963. G.H.Khachatrian, M.K.Kuregian, K.R.IspiryanJ.L.Massey. Fast multiplication of integers for public-key applications. S.VaudenayA.M.Youssef, , 245\U254. Berlin, Heidelberg, 2001. Springer. D.Khaldi, Y.Luo, B.Yu, A.Sotkin, B.MoraisM.Girkar. Extending LLVM IR for DPC++ matrix support: a case study with Intel\ advanced matrix extensions (Intel\ AMX). , 20\U26. 2021. B.Kuzma, I.Korostelev, J.P.L.de Carvalho, J.E.Moreira, C.Barton, G.AraujoJ.N.Amaral. Fast matrix multiplication via compiler-only layered data reorganization and intrinsic lowering. , 53(9):1793\U1814, 2023. Linear Algebra PACKage. . Ch.L.Lawson, R.J.Hanson, D.R.KincaidF.T.Krogh. Basic linear algebra subprograms for fortran usage. , 5(3):308\U323, 1979. H.-J.Lebbink. Hjlebbink/amx-matmul. 2024. . T.Li. Usstq/mm_amx. , 2024. T.M.Low, F.D.Igual, T.M.SmithE.S.Quintana-Orti. Analytical modeling is enough for high-performance BLIS. , 43(2):12\U1, 2016. H.Martínez, A.Castelló, F.D.IgualE.S.Quintana-Ortí. The Cambrian explosion of mixed-precision matrix multiplication for quantized deep learning inference. , 108231, 2025. C.S.Mummidi, V.C.Ferreira, S.SrinivasanS.Kundu. Highly efficient self-checking matrix multiplication on tiled AMX accelerators. , 21(2):1\U22, 2024. Hiroyuki Ootomo, Katsuhisa OzakiRio Yokota. Dgemm on integer matrix multiplication unit. , 38(4):297\U313, 2024. D.N.Parikh, R.A.van de GeijnG.M.Henry. Cascading gemm: high precision from low precision. 2023. . M.Püschel, J.M.F.Moura, J.Johnson, D.Padua, M.Veloso, B.Singer, J.Xiong, F.Franchetti, A.Gacic, Y.Voronenko, K.Chen, R.W.JohnsonN.Rizzolo. SPIRAL: code generation for DSP transforms. , 93(2):232\U275, 2005. Special issue on \PProgram Generation, Optimization, and Adaptation\Q. A.SchönhageV.Strassen. Schnelle Multiplikation groÿer Zahlen. , 7:281\U292, 1971. Michael Scott. Missing a trick: Karatsuba variations. , 10(1):5\U15, 2018. V.Shoup. NTL: a library for doing number theory. 1996. . V.Strassen. Gaussian elimination is not optimal. , 13:352\U356, 1969. The OpenBLAS developers. Openblas. , feb 2026. A.L.Toom. The complexity of a scheme of functional elements realizing the multiplication of integers. , 4(2):714\U716, 1963. Y.Uchino, K.OzakiT.Imamura. High-performance and power-efficient emulation of matrix multiplication using INT8 matrix engines. , 1824\U1831. ACM, 2025. F.G.Van ZeeR.A.van de Geijn. BLIS: a framework for rapidly instantiating BLAS functionality. , 41(3):14\U1, 2015. Q.Wang, X.Zhang, Y.ZhangQ.Yi. AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs. , SC '13, 1\U12. New York, NY, USA, 2013. ACM. R.C.Whaley, A.PetitetJ.J.Dongarra. Automated empirical optimizations of software and the ATLAS project. , 27(1\U2):3\U35, 2001. V.V.Williams, Y.Xu, Z.XuR.Zhou. New bounds for matrix multiplication: from alpha to omega. , 3792\U3835. SIAM, 2024. As motivated in the introduction, all timings that we reported so far are relative to the theoretical peak performance. For completeness, this appendix reports on the corresponding absolute timings. We also added absolute timings for existing software. Tables to present some timings obtained with , NTL, and , three well-known algebraic computation libraries. and are configured to link with . (We thank Pascal Giorgi for his help with .) Since modular matrix multiplication has often benefited from more optimization effort than plain integer matrix multiplication, we include both matrices with entries in> and in /2*\>. <\big-table> ||||||||||||||||||||||||>|||||||||||||>>||\s>|\s>|\s>|\s>|\s>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||\s>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>||||>||ms>|ms>|ms>|ms>|ms>|ms>|ms>||||||>||ms>|ms>|ms>|ms>|ms>|ms>|||||||>||ms>|ms>|ms>||||||||||>||||||||||||||>||||||||||||||>||||||||||||||>>>>> \; <|big-table> Absolute timings for multiplication of two matrices in n>> using the function from . <\big-table> ||||||||||||||||||||||||>||||||||||||>>||\s>|\s>|\s>|\s>|\s>|ms>|ms>|ms>|ms>|ms>|ms>|>||\s>|\s>|\s>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>||>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>||||>||ms>|ms>|ms>|ms>|ms>|||||||>||ms>|ms>|ms>|ms>||||||||>||ms>|ms>||||||||||>||ms>|||||||||||>|||||||||||||>>>>> <|big-table> Absolute timings for multiplication of two matrices in /2*\|)>n>> using the (for 64>) or (otherwise) function from . <\big-table> ||||||||||||||||||||||||>|||||||||||||>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>| 64 |ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>||>||ms>|ms>|ms>|ms>|ms>|ms>||ms>|||||>||||||||||||||>||||||||||||||>||||||||||||||>>>> <|big-table> Absolute timings for multiplication of two matrices in n>> represented as NTL objects of type . <\big-table> ||||||||||||||||||||||||>|||||||||||||>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>| 64 |ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>||>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|||>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|||||>||ms>|ms>|ms>|ms>|ms>||||||||>||||||||||||||>||||||||||||||>||||||||||||||>>>> <|big-table> Absolute timings for multiplication of two matrices in /2*\|)>n>> represented as NTL objects of type . <\big-table> ||||||||||||||>||||>>||\s>|\s>|\s>|>| 64 |\s>|ms>|ms>|>||\s>|ms>|ms>|>||ms>|ms>|ms>|>||ms>|ms>|ms>|>||ms>|ms>|ms>|>||ms>|ms>|ms>|>||ms>|ms>|ms>|>||ms>|||>||ms>|||>|||||>>>> <|big-table> Absolute timings for multiplication of two matrices in /2*\|)>n>> represented as NTL objects of type (word-sized). <\big-table> |||||||||||||||||||||||||||||>|||||||||||||>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>||||>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|||||>||ms>|ms>|ms>|ms>|||||||||>||ms>|ms>|||||||||||>||||||||||||||>||||||||||||||>||||||||||||||>>>>> <|big-table> Absolute timings for multiplication of two matrices n>> using over Givaro::Integer\>. Tables, , and are the absolute counterparts of Tables, , and from Section. Note that Table corresponds to the left-hand of Table (for multiplicands in >|)>n>>), whereas Table corresponds to the right-hand of Table (for multiplicands that use the alternative representation inn>|]>>>). <\big-table||||||||||||||||||||||||||||||||||||||||||||||||||>|||||||||>>>||ns>|ns>|ns>|ns>|ns>|ns>|ns>|\s>|>||ns>|ns>|ns>|ns>|\s>|\s>|\s>|\s>|>||ns>|ns>|\s>|\s>|\s>|\s>|\s>|\s>|>||ns>|\s>|\s>|\s>|\s>|\s>|\s>|\s>|>||ns>|\s>|\s>|\s>|\s>|\s>|\s>|\s>|>||\s>|\s>|\s>|\s>|\s>|\s>|ms>|ms>|>||\s>|\s>|\s>|\s>|\s>|ms>|ms>|ms>|>||\s>|\s>|\s>|ms>|ms>|ms>|ms>|ms>|>>>>>> Absolute timings for pure SIMD multiplication of two matrices in >|)>n>>. <\big-table||||||||||||||||||||||||||||||||||||||||||||||||||>|||||||||>>>||ns>|ns>|ns>|ns>|ns>|ns>|ns>|\s>|>||ns>|ns>|ns>|ns>|\s>|\s>|\s>|\s>|>||ns>|ns>|\s>|\s>|\s>|\s>|\s>|\s>|>||ns>|\s>|\s>|\s>|\s>|\s>|\s>|\s>|>||\s>|\s>|\s>|\s>|\s>|\s>|\s>|\s>|>||\s>|\s>|\s>|\s>|\s>|\s>|\s>|ms>|>||\s>|\s>|\s>|\s>|\s>|ms>|ms>|ms>|>||\s>|\s>|\s>|ms>|ms>|ms>|ms>|ms>|>>>>>> Absolute timings for tinkered SIMD multiplication of two matrices in n>|]>>>. <\big-table||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||>|||||||||>>>|||ns>||ns>|ns>|ns>|ns>|ns>|>||ns>|ns>|ns>|ns>|ns>|\s>|\s>|\s>|>|||ns>||\s>|\s>|\s>|\s>|\s>|>||ns>|ns>|\s>|\s>|\s>|\s>|\s>|\s>|>||ns>|\s>|\s>|\s>|\s>|\s>|\s>|\s>|>||\s>|\s>|\s>|\s>|\s>|\s>|\s>|\s>|>||\s>|\s>|\s>|\s>|\s>|\s>|ms>|ms>|>||\s>|\s>|\s>|\s>|\s>|ms>|ms>|ms>|>>>>>> Absolute timings for high multiplication of matrices in >|)>n>> in \ tinkered SIMD mode. Tables and present the absolute counterparts of the timings in Table from Section (the left-hand and right-hand tables respectively). Similarly, Tables and provide absolute counterparts for the timings in Table. \; <\big-table|||||||||||||||||||||||>||||||||>>>||\s>|\s>|\s>|\s>|ms>|ms>|ms>|ms>>||\s>|\s>|ms>|ms>|ms>|ms>|ms>|ms>>||\s>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>||>||ms>|ms>|ms>|ms>||||>||ms>|ms>|ms>|||||>||ms>|||||||>||ms>|||||||>|||||||||>|||||||||>>>>> \ > Absolute timings for a naive long \\\2*\> limb n> matrix product using IFMA. <\big-table|||||||||||||||||||||||||||||||||||||||||||||||||||||>||||||||>>>||\s>|\s>|\s>|\s>|\s>|ms>|ms>|ms>>||\s>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||ms>|ms>|ms>|ms>|ms>|ms>||>||ms>|ms>|ms>|ms>||||>||ms>|||||||>|||||||||>|||||||||>|||||||||>>>>> > Absolute timings for a long \\\2*\> limb n> matrix product using IFMA and CRT. <\big-table|||||||||||||||||||||||||||||||||>||||||||>>>||\s>|\s>|\s>|\s>|\s>|ms>|ms>|ms>>||\s>|\s>|ms>|ms>|ms>|ms>|ms>|ms>>||\s>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|ms>||>||ms>|ms>|ms>|ms>||||>||ms>|ms>|ms>|||||>||ms>|||||||>||ms>|||||||>|||||||||>|||||||||>>>>> > Absolute timings for a long \\\2*\> limb n> matrix product using IFMA and the non-recursive variant of Karatsuba's algorithm. <\big-table|||||||||||||||||||||||||||||||||||||||||||>||||||||>>>||\s>|\s>|\s>|\s>|\s>|\s>|ms>|ms>>||\s>|\s>|\s>|ms>|ms>|ms>|ms>|ms>>||\s>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||\s>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|>||ms>|ms>|ms>|ms>|ms>|||>||ms>|ms>|ms>|ms>||||>||ms>|ms>||||||>||ms>|||||||>|||||||||>|||||||||>>>>>> Absolute timings for a high \\\\> limb n> matrix product using IFMA and the non-recursive variant of Karatsuba's algorithm. Table shows absolute timings for naive AMX-accelerated matrix multiplication at various precisions; the corresponding relative timings were given in Table. <\big-table|<\small> ||||||||||||||||||||||||||||||||||||||||>|||||||||||>||\s>|\s>|\s>|\s>|\s>|\s>|\s>|\s>|\s>|\s>|\s>>||\s>|\s>|\s>|\s>|\s>|\s>|\s>|\s>|\s>|\s>|ms>>||\s>|\s>|\s>|\s>|\s>|ms>|\s>|\s>|\s>|ms>|ms>>||\s>|\s>|\s>|\s>|ms>|ms>|\s>|\s>|ms>|ms>|ms>>||\s>|\s>|ms>|ms>|ms>|ms>|\s>|ms>|ms>|ms>|ms>>||\s>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||\s>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>||ms>|ms>|ms>|ms>|ms>||ms>|ms>|ms>||>||ms>|ms>|ms>||||ms>||||>||ms>||||||||||>||ms>||||||||||>>>> \; > Time to multiply two n> matrices with bit positive integer coefficients using naive AMX-accelerated arithmetic. The columns Np corresponds to a p\p> bit low product, whereas the columns Lp corresponds to a p\2*p> bit long product. The column AMX corresponds to the raw 8\32> bit product from Section. <\initial> <\collection> <\attachments> <\collection> <\associate|bib-bibliography> <\db-entry|+byKiFsUoTc7rBN|book|TeXmacs:vdH:book> <|db-entry> > <\db-entry|+qYhUkmR1lNMNuvD|book|BiPa94> <|db-entry> V. Y. > <\db-entry|+UCHwtGN1sHSR59v|misc|JIL> <|db-entry> J. van der G. > > <\db-entry|+1ne7GKoD1upkFPyj|article|vdH:slp> <|db-entry> G. > > <\db-entry|+qYhUkmR1lNMNv2o|article|Kar63> <|db-entry> J. > <\db-entry|+qYhUkmR1lNMNvAe|article|Toom63b> <|db-entry> > <\db-entry|+qYhUkmR1lNMNuzt|book|GaGe13> <|db-entry> J. > <\db-entry|+33fX45UG6TF|article|CT65> <|db-entry> J. W. > <\db-entry|+qYhUkmR1lNMNv9s|article|SS71> <|db-entry> V. > <\db-entry|+Oyo7B0N11pDKjsi|article|vdH:nlogn> <|db-entry> J. van der > >> <\db-entry|+qYhUkmR1lNMNv0Y|misc|GMP> <|db-entry> > > <\db-entry|+qYhUkmR1lNMNv60|misc|MPFR> <|db-entry> V. K. P. > > <\db-entry|+haC4GcdToVz5G7|inproceedings|vdH:quad> <|db-entry> G. > <\db-entry|+CZZmVFYybXmQ9o|inproceedings|GK16> <|db-entry> V. > <\db-entry|+CZZmVFYybXmQ9n|inproceedings|ET19> <|db-entry> D. > <\db-entry|+33fX45UG6T8|article|BDH21> <|db-entry> N. M. > > <\db-entry|+2bGzr1Fn1TuwbdmZ|techreport|vdH:ntt> <|db-entry> G. > , accepted for publication in AAECC> <\db-entry|+2bGzr1Fn1Tuwbdn2|inproceedings|WXXZ24> <|db-entry> Y. Z. R. > <\db-entry|+qYhUkmR1lNMNvA8|article|Str69> <|db-entry> > <\db-entry|+2bGzr1Fn1Tuwbdn3|techreport|daSilvaPereira2025> <|db-entry> > <\db-entry|+2bGzr1Fn1Tuwbdmd|inproceedings|EON26> <|db-entry> S. T. > <\db-entry|+2bGzr1Fn1Tuwbdmv|misc|Li2024> <|db-entry> > > <\db-entry|+2bGzr1Fn1Tuwbdmu|misc|Lebbink2024> <|db-entry> > > <\db-entry|+2bGzr1Fn1Tuwbdmj|misc|OpenBLAS> <|db-entry> > > > <\db-entry|+2bGzr1Fn1Tuwbdn4|misc|GGML26> <|db-entry> > > <\db-entry|+2INVvwRj1i4eBMh6|misc|Cawley24> <|db-entry> > > <\db-entry|+2bGzr1Fn1Tuwbdme|article|FBL24> <|db-entry> G. J. > <\db-entry|+2bGzr1Fn1Tuwbdmo|article|MCIQO25> <|db-entry> A. F. D. E. S. > <\db-entry|+2bGzr1Fn1TuwbdnB|misc|IntelOptimizationManual2024> <|db-entry> > 64 and IA-32 architectures optimization reference manual: volume 1> > <\db-entry|+2bGzr1Fn1Tuwbdmp|article|OOY24> <|db-entry> Katsuhisa Rio > <\db-entry|+2bGzr1Fn1Tuwbdn8|article|MFS24> <|db-entry> V. C. S. S. > <\db-entry|+2bGzr1Fn1Tuwbdmx|inproceedings|UOI25> <|db-entry> K. T. > <\db-entry|+2bGzr1Fn1Tuwbdms|misc|ADFMT26> <|db-entry> J. M. M. F. > <\db-entry|+2bGzr1Fn1Tuwbdmf|misc|GBKAVAHPH19> <|db-entry> K. D. S. A. M. G. H. A. > > <\db-entry|+2INVvwRj1i4eBMh5|misc|BLAS> <|db-entry> > <\db-entry|+2INVvwRj1i4eBMh9|misc|LAPACK> <|db-entry> > <\db-entry|+2bGzr1Fn1TuwbdnL|misc|ATLAS> <|db-entry> > <\db-entry|+2bGzr1Fn1TuwbdnD|misc|BLIS> <|db-entry> > <\db-entry|+2bGzr1Fn1TuwbdnK|article|LHKK79> <|db-entry> R. J. D. R. F. T. > <\db-entry|+2bGzr1Fn1TuwbdnE|article|DDHH88> <|db-entry> J. Du S. R. J. > <\db-entry|+2bGzr1Fn1TuwbdnF|article|DDHD90> <|db-entry> J. Du S. I. > <\db-entry|+2bGzr1Fn1Tuwbdmh|inproceedings|GHG01> <|db-entry> G. M. R. A. > <\db-entry|+2bGzr1Fn1Tuwbdma|article|WPD01> <|db-entry> A. J. J. > <\db-entry|+2bGzr1Fn1Tuwbdmg|article|GG08> <|db-entry> R. A. > <\db-entry|+2bGzr1Fn1Tuwbdmz|article|ZG15> <|db-entry> R. A. > <\db-entry|+2bGzr1Fn1Tuwbdmn|article|LISQO16> <|db-entry> F. D. T. M. E. S. > <\db-entry|+qYhUkmR1lNMNv4Y|inproceedings|LINBOX02> <|db-entry> T. M. P. B. E. B. D. W. J. G. > <\db-entry|+2bGzr1Fn1Tuwbdmc|article|DGLS18> <|db-entry> P. R. É. > 27> <\db-entry|+2bGzr1Fn1TuwbdnI|inproceedings|HS14> <|db-entry> A. V. > XI)> <\db-entry|+qYhUkmR1lNMNvES|article|vdH:zmatmult> <|db-entry> J. van der > <\db-entry|+2INVvwRj1i4eBMh8|misc|Flint> <|db-entry> > > <\db-entry|+qYhUkmR1lNMNvEL|article|vdH:simd> <|db-entry> G. G. > <\db-entry|+qYhUkmR1lNMNuzA|article|FFLM20> <|db-entry> A. F. M. > <\db-entry|+2bGzr1Fn1Tuwbdmb|misc|BGLM26> <|db-entry> S. D. T. > > > <\db-entry|+2bGzr1Fn1Tuwbdmq|misc|PGH23> <|db-entry> R. A. G. M. > > > <\db-entry|+33fX45UG6TJ|inproceedings|Fri99> <|db-entry> > '99> <\db-entry|+1QvsQXLs1XLbdH0D|article|FJ05> <|db-entry> S.G. > <\db-entry|+33fX45UG6TE|article|Pue05> <|db-entry> J. M. F. J. D. M. B. J. F. A. Y. K. R. W. N. > <\db-entry|+2bGzr1Fn1Tuwbdmy|inproceedings|WZZY13> <|db-entry> X. Y. Q. > <\db-entry|+2bGzr1Fn1Tuwbdmi|inproceedings|KLYSMG21> <|db-entry> Y. B. A. B. M. > advanced matrix extensions (Intel\ AMX)> <\db-entry|+2bGzr1Fn1Tuwbdml|article|KKCMBAA23> <|db-entry> I. J. P. L. J. E. C. G. J. N. > <\db-entry|+UxLEI9PJlFl3Ur|article|vdH:relax> <|db-entry> > <\db-entry|+2bGzr1Fn1Tuwbdmm|inproceedings|KKIM01> <|db-entry> M. K. K. R. J. L. > A. M. > <\db-entry|+qYhUkmR1lNMNuwQ|book|BZ10> <|db-entry> P. > <\db-entry|+2bGzr1Fn1Tuwbdmr|article|Scott2018> <|db-entry> > <\db-entry|+tw6KmG8XaVvRP|misc|NTL> <|db-entry> > > <\db-entry|+2INVvwRj1i4eBMh7|article|DGP08> <|db-entry> P. C. > <\references> <\collection> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > <\auxiliary> <\collection> <\associate|bib> TeXmacs:vdH:book BiPa94 JIL vdH:slp Kar63 Toom63b GaGe13 CT65 SS71 vdH:nlogn GMP MPFR vdH:quad GK16 ET19 BDH21 vdH:ntt WXXZ24 Str69 daSilvaPereira2025 EON26 Li2024 Lebbink2024 OpenBLAS GGML26 Cawley24 FBL24 MCIQO25 IntelOptimizationManual2024 OOY24 MFS24 UOI25 ADFMT26 GBKAVAHPH19 BLAS LAPACK ATLAS BLIS OpenBLAS LHKK79 DDHH88 DDHD90 GHG01 WPD01 GG08 ZG15 LISQO16 LINBOX02 DGLS18 HS14 vdH:zmatmult Flint vdH:simd FFLM20 BGLM26 UOI25 PGH23 Fri99 FJ05 Pue05 WZZY13 KLYSMG21 KKCMBAA23 GMP Kar63 Toom63b vdH:relax KKIM01 BZ10 Scott2018 DGLS18 DGLS18 JIL vdH:slp vdH:slp GG08 ZG15 IntelOptimizationManual2024 Li2024 EON26 Flint NTL DGP08 <\associate|figure> |1>|> Illustration of a full multiple precision integer product (on the left) and a truncated high product (on the right) using integer FMAs (the lower yellow triangles and the upper orange triangles correspond to ||par-mode|||font-shape||ifmalo>>> and ||par-mode|||font-shape||ifmahi>>>, respectively). |> |2>|> Illustration of the recording process of a program as an SLP. For the purpose of readability, we have chosen nice names |a,a,\,c> for the data fields of the SLP. Inside JIL, the data fields are rather referred to through indices |0,1,\,11>. |> <\associate|table> |1>|> Overview of the factor ||par-mode|||font-shape||E>>|)>> as a function of |\> for various multiplication schemes, as well as the corresponding losses in bit precision. For CRTs, the bit loss is |\=1> for |\=52> or |\=8> and |\\8>, but |\=5> for |\=8> and |\=16>. The first four rows correspond to full multiplication, whereas the other rows correspond to high multiplication. The last three rows counts the number of IFMA operations that are required if low and high machine products need to be computed separately. For Karatsuba multiplication, we do not recurse, except for Karatsuba, which recurses once. |> |2>|> Pure SIMD integer matrix multiplication using a naive algorithm and IFMA. We show the ratios with respect to the theoretical peak performance of two IFMAs per cycle. |> |3>|> Tinkered SIMD integer matrix multiplication using a naive algorithm and IFMA. We show the ratios with respect to the theoretical peak performance of two IFMAs per cycle. |> |4>|> Tinkered SIMD |n\n> matrix multiplication for |\> limb fixed point entries and using a naive algorithm and IFMA. We show the ratios with respect to the theoretical peak performance of two IFMAs per cycle. |> |5>|> Relative timings for one long |\\\\2*\> limb |n\n> matrix product using IFMA. The left-hand and right-hand sides respectively indicate the relative timings (with respect to theoretical peak performance) for the naive and the CRT strategies. |> |6>|> Relative timings for one long |\\\\2*\> or one high |\\\\\> limb |n\n> matrix product (on the left-hand and right-hand sides, respectively) using IFMA and the non-recursive variant of Karatsuba's algorithm. |> |7>|> Factors with respect to theoretical peak performance for multiplying two |n\n> matrices with |p> bit positive integer coefficients. The columns Np corresponds to a |p\p\p> bit low product, whereas the columns Lp corresponds to a |p\p\2*p> bit long product. The column AMX corresponds to the raw |8\8\32> bit product from the Section |->|-0.3em|>|0em||0em|>>. |> |8>|> Absolute timing for |n> direct CRT on |8*\>-bit numbers and |N> moduli, using the reduction to an |n\\\N> matrix product. |> |9>|> Absolute timings for multiplication of two matrices in |\n>> using the |fmpz_mat_mul> function from |Flint>. |> |10>|> Absolute timings for multiplication of two matrices in |/2*\|)>n>> using the |nmod_mat_mul> (for |p\64>) or |mpn_mod_mat_mul> (otherwise) function from |Flint>. |> |11>|> Absolute timings for multiplication of two matrices in |\n>> represented as NTL objects of type |mat_ZZ>. |> |12>|> Absolute timings for multiplication of two matrices in |/2*\|)>n>> represented as NTL objects of type |mat_ZZ_p>. |> |13>|> Absolute timings for multiplication of two matrices in |/2*\|)>n>> represented as NTL objects of type |mat_zz_p> (word-sized |->|-0.3em|>|0em||0em|>>|p>). |> |14>|> Absolute timings for multiplication of two matrices |\n>> using |FFLAS::fgemm> over |Givaro::ZRing\Givaro::Integer\>. |> |15>|> Absolute timings for pure SIMD multiplication of two matrices in |>|)>n>>. |> |16>|> Absolute timings for tinkered SIMD multiplication of two matrices in |\n>|]>>>. |> |17>|> Absolute timings for high multiplication of matrices in |>|)>n>> in \ tinkered SIMD mode. |> |18>|> Absolute timings for a naive long |\\\\2*\> limb |n\n> matrix product using IFMA. |> |19>|> Absolute timings for a long |\\\\2*\> limb |n\n> matrix product using IFMA and CRT. |> |20>|> Absolute timings for a long |\\\\2*\> limb |n\n> matrix product using IFMA and the non-recursive variant of Karatsuba's algorithm. |> |21>|> Absolute timings for a high |\\\\\> limb |n\n> matrix product using IFMA and the non-recursive variant of Karatsuba's algorithm. |> |22>|> Time to multiply two |n\n> matrices with |p> bit positive integer coefficients using naive AMX-accelerated arithmetic. The columns Np corresponds to a |p\p\p> bit low product, whereas the columns Lp corresponds to a |p\p\2*p> bit long product. The column AMX corresponds to the raw |8\8\32> bit product from Section |->|-0.3em|>|0em||0em|>>. |> <\associate|toc> |math-font-series||1Introduction> |.>>>>|> |Motivation |.>>>>|> > |Previous work |.>>>>|> > |Main contributions |.>>>>|> > |math-font-series||2Preliminaries> |.>>>>|> |2.1Notations |.>>>>|> > |2.2SIMD arithmetic |.>>>>|> > |2.3Benchmarks |.>>>>|> > |math-font-series||3Multiple precision arithmetic> |.>>>>|> |3.1The standard representation |.>>>>|> > |3.2Redundant representations |.>>>>|> > |3.3Naive multiple precision multiplication |.>>>>|> > |3.4Vector and matrix limbs |.>>>>|> > |3.5Karatsuba multiplication |.>>>>|> > |3.6Chinese remaindering |.>>>>|> > |3.7Summary and asymptotic complexity |.>>>>|> > |math-font-series||4Multiplying small matrices using IFMA> |.>>>>|> |4.1Implementation strategy |.>>>>|> > |4.2Pure SIMD mode |.>>>>|> > |4.3Tinkered SIMD mode |.>>>>|> > |4.4Fixed point entries |.>>>>|> > |math-font-series||5Multiplying large matrices using IFMA> |.>>>>|> |5.1Implementation strategy |.>>>>|> > |5.2Relative timings with respect to theoretical peak performance |.>>>>|> > |math-font-series||6Accelerating matrix multiplication using AMX> |.>>>>|> |6.1Implementation strategy |.>>>>|> > |6.2Int8 matrix FMAs |.>>>>|> > |6.3Naive multiple precision multiplications |.>>>>|> > |6.4Chinese remaindering |.>>>>|> > |math-font-series||Bibliography> |.>>>>|> |math-font-series||Appendix AAbsolute timings> |.>>>>|> |A.1Some reference timings for existing software |.>>>>|> > |A.2Multiplying small matrices using IFMA |.>>>>|> > |A.3Multiplying large matrices using IFMA |.>>>>|> > |A.4Multiplying matrices using AMX |.>>>>|> >