Integer matrix multiplicationat various precisions using AMX and IFMA

In this paper we study hardware-accelerated integer matrix multiplication, with coefficients of sizes between 8 and 1000 bits. More particularly, we study two relatively new hardware features in Intel CPUs: the IFMA “integer FMA” instruction and the AMX matrix extensions. We study various algorithms and analyze to what extent our implementations on top of the JIL library can approach theoretical peak performance.

1Introduction

Motivation

Arguably the most significant recent trend in computer hardware is the eruption of tensor cores in both GPUs and CPUs. Although driven by applications in artificial intelligence, is there a potential for tensor cores to be used for other purposes? After all, GPUs have found many applications beyond their initial ones in the gaming industry.

The main motivation of this paper is to study this question for applications in scientific computation and more particularly in computer algebra and computer arithmetic. We will focus on the problem of

matrix multiplication with

-bit integer entries. Tensor cores are designed to accelerate such computations in the case where

is large and

. Can this be used to efficiently deal with other bit precisions, ranging from small ones like

to larger ones like

We mainly restrict our attention to the multiplication of square

matrices. This is indeed most favorable for tensor cores, but also most significant from a complexity perspective, since the complexity of many problems in linear algebra and beyond can be expressed in terms of the complexity of square matrix products [5]. Of course, this restriction will also simplify comparisons, because varying both

and

will already give rise to numerous different cases, as we shall see.

To be fair, we should also compare implementations that exploit tensor cores with alternative ones that are based on more conventional vector arithmetic. For this reason, we chose to focus on recent Intel Xeon processors which support both AMX matrix products and AVX512 vector arithmetic with IFMA. Using AMX, each core can do a

matrix FMA (MFMA)

in about

cycles, where

and

(of dimensions

and

) have

-bit entries and

(of dimension

) has 32-bit entries. Roughly speaking, an integer FMA (IFMA) allow us to compute

in one cycle, where

and

are

-wide vectors of

-bit integers and

is an

-wide vector of

-bit integers. Note that we preferred the terminology MFMA over GeMM and (general matrix multiplication

) because AMX only supports positive accumulation.

In this paper, we focus on the performance of implementations that use a single core of our CPU. This simplification allows to measure the relevance of different implementation strategies that exploit different hardware features (IFMA or AMX), without being distracted by mostly orthogonal considerations on how to scale up these base implementations on multi-core architectures.

All our timings will be relative to theoretical peak performance. For instance, for a naive implementation of

matrix multiplication using

IFMAs, we will report on the cost of the matrix multiplication divided by the theoretical cost to perform

IFMAs. These ratios are fairly independent of the particular machine on which we execute our programs and indicate the quality of our implementation. Quotients close to one mean that we could not have done much better for the algorithm at hand. Higher ratios indicate potential problems such as suboptimal memory access patterns or asymptotically subdominant costs that become high for certain sizes. In an ideal world in which all ratios are one, the theoretical complexity analysis of our methods would correspond perfectly to their practical performance. For completeness, we also added an appendix with absolute timings, both for our new implementation and some existing software.

All implementations in this paper are done inside or on top of the C++ library JIL [2, 40] for computations with straight-line programs (SLPs). This work is the first step of a project to the JIL library with fast linear algebra over the integers and fixed point numbers. We would like to cover all ranges of matrix dimensions and bit-sizes of the integers, while taking advantage of recent hardware features such as IFMA and AMX.

Conversely, matrix multiplication is an excellent problem for extending JIL beyond the mere compilation of SLPs. Another important motivation behind the present work is to use matrix multiplication as a case study for how to integrate SLPs inside larger pieces of code such as multiple loops. The integration of AMX instructions into the SLP framework is also a challenge.

Previous work

Multiple precision integer multiplication has a long history and there exist many algorithms for various bit precisions, such as the traditional schoolbook method, Karatsuba's algorithm [43], Toom's method [62], and modular methods [24, Chapter 5]. For high precisions, there are also methods based on the fast Fourier transform [11, 57, 34], but we will not use them in this paper. GMP [28] and MPFR [31] are the reference libraries for multiple precision integer and floating point arithmetic. We also refer to [38, 29, 18, 8, 39] for some recent work on this topic in the HPC context.

In the context of integer matrix multiplication, fast multiple precision methods tend to become useful at already very low precisions. This is due to the fact that all known methods can be regarded as modular or evaluation-interpolation methods. Roughly speaking, multiplying two

matrices with

-bit entries is reduced to

multiplications of

matrices with

-bit entries, where the quantity

depends on the method (this will be detailed in Section 3 and Table 1 below). The cost of this reduction scales with

, whereas the cost of the

low precision matrix multiplications scales with

. Consequently, the cost of the reduction becomes negligible for large

and, for a given

, it is recommended to pick the method for which

is minimal. Good practical implementations should reflect this fact and this is indeed what our work aims at.

Theoretically speaking [67],

matrices can be multiplied in time

with

for large

. However, among the asymptotically faster algorithms with

, only Strassen's algorithm [60] with

has allowed for some modest practical gains so far. The recent trend to accelerate matrix multiplication directly in the hardware leads to more significant practical gains. (In fact, these hardware accelerations make the practical complexity of matrix multiplication behave as if

were well below

. The potential impact of this observation is another motivation behind our work.)

Several works [12, 19] and research implementations [50, 49] specifically address the challenges raised by Intel's AMX technologies. OpenBlas [61] also contains a bf16 kernel (sbgemm_kernel_16x16_spr_tmpl.c) and GGML [26] an int8 kernel interleaved by encoding and decoding of quantized tensors (ggml-cpu/amx/mmq.cpp). Other types of CPUs integrate tensor accelerations in different ways. For instance, Apple's secret matrix coprocessor in M1–M3 CPUs is based on fast “outer products” [10], for which separate techniques have been developed [20]. We refer to [52] for an overview of different types of accelerators for matrix multiplication in recent CPUs and to [42, Chapter 20] for details about Intel's AMX units. The use of tensor cores to multiply matrices of floating-point numbers of precisions beyond the word length of the accelerator is studied in [54, 53, 63, 1]. Georganas et al. [25] propose a batch-reduce kernel that may also be useful for this purpose.

Before the emergence of tensor cores, classical reference implementations for matrix multiplication and other routines from linear algebra include Blas [6], Lapack [47], Atlas [3], Blis [7] and OpenBlas [61]. We refer to [48, 15, 14, 30, 66, 27, 64, 51] for some of the underlying scientific background. For matrices with multiple precision integer entries, the Chinese remainder theorem (CRT) is typically used to reduce matrix multiplication to modular matrix multiplication, which is then performed using a standard BLAS [16, 13]. When working with very high precisions, FFT-based techniques are also used [35, 33]. The Flint library [32] performs well on a large spectrum of matrix dimensions and integer sizes. We refer to [41, 21] for SIMD-accelerated modular arithmetic and to [4, 63] for CRT-based matrix multiplication on modern hardware. Doubling the precision of a BLAS has been considered in [55].

Our work on JIL also draws inspiration from the compilation of codelets for critical tasks as pioneered in Fftw3 [22, 23] and Spiral [56]. Matrix multiplication has also been studied from a similar compilation-oriented perspective [65, 45, 46].

Main contributions

We report on an experimental HPC implementation of integer matrix multiplication for a wide range of matrix dimensions and bit sizes of integers. We have spent a comparable amount of effort on exploiting two types of hardware acceleration, namely the IFMA and AMX extensions in recent Intel processors. In Table 1, we presented an idealized complexity model for large matrix dimensions as a function of the bit size of the coefficients and the integer multiplication method being used. Although the underlying methods are not new, we expect this synthesis to be useful as a guideline for implementors. Our implementations indeed often come close to the theoretical peak performance for which our idealized model is most relevant. This holds especially for our IFMA-based implementation and a bit less for the AMX-based one. For “large”

matrices, say

, we observe that AMX-based algorithms outperform the IFMA-based ones. Although the gain is modest for now, we expect that it can be improved with more work. Another potential advantage of AMX is that we have a more fine-grained control over the bit-precision. Our implementations also outperform the current reference libraries Flint and NTL.

2Preliminaries

2.1Notations

Throughout this paper, we will write

for the set of positive integers, including zero, and

for the set of unsigned

-bit integers.

Given a ring

and

, we write

for the set of vectors

of length

with entries in

. We will also write

for the set of polynomials of degree

over

. We will assume the natural dense representation for such polynomials

by their vectors

of coefficients.

We formally extend the polynomial notation to the case when

is replaced by an integer radix

, typically of the form

for some

. This allows us to rewrite unsigned integers in

as “polynomials”

in the radix. Following terminology from GMP [28], it is customary to call the coefficients

limbs. If

is a machine integer type, then we may represent an integer

by its vector

of limbs.

For dimensions

, we also write

for the set of

matrices with entries in

. Unless stated otherwise, we will always assume that matrices

are represented in row-major order on a computer, as vectors

. By “

matrix multiplication”, we understand the multiplication of an

matrix with an

matrix into an

matrix.

2.2SIMD arithmetic

Let

be the machine precision (or one of the available machine precisions). Standard arithmetic operations on

naturally extend to SIMD vectors of width

. Whenever a function that only involves basic arithmetic operations needs to be evaluated exactly

times, this can be done replacing all operations by their SIMD versions.

For instance, given a matrix

and a vector

, consider the function that computes the matrix-vector product

. If we need to evaluate this function

times, then the simplest solution is to present the input as a matrix

and a vector

, in which case the exact same formula

yields all

matrix-vector products in SIMD format.

We call this way of vectorizing the pure SIMD mode. This works best whenever a given function needs to be evaluated a multiple of

times. This also assumes that all data types are replaced functorially by their standard vectorized versions. For instance, the standard vectorization of the matrix type

. The standard vectorization of a multiple precision integer type

consists of working with “limbs” that are vectors in

. Hence SIMD multiple precision integers

are represented as

with

and

. Note that this may not seem to be most “natural”, since one might prefer to represent

with

SIMD arithmetic can often be used as well for evaluating a function only once. For instance, the multiplication

of two matrices

and

can benefit from SIMD acceleration whenever

is a multiple of the SIMD width

. Indeed, it suffices to reinterpret

as a matrix

and

as a matrix

, while casting

to a matrix

by repeating each of its entries

times. After that, we can compute

in the pure SIMD mode. However, these reinterpretations require some gymnastics at least, and exploiting SIMD arithmetic becomes even more delicate when none of the dimensions

are a multiple of

. We will come back to this tinkered SIMD mode in Section 4.3 below.

2.3Benchmarks

In this paper, we mainly report on relative timings with respect to the theoretical peak performance. The only exception is the appendix, which contains absolute timings, as well as timings for existing software. All benchmarks were obtained on an Intel Xeon Gold 6538Y+ processor with 32 cores and running at a clock rate of 2.2 GHz. Our timings rely on the rdtsc instruction for measuring time in terms of clock cycles. Recall that all our algorithms use only one core of the CPU.

3Multiple precision arithmetic

3.1The standard representation

Let

be our favorite (and supported) machine bit precision for integer arithmetic, like

. The standard machine representation of an integer

is the vector

of “its” limbs.

The hardware is assumed to provide an instruction for multiplying two limbs

, with a result in

(sometimes, the low and high parts need to be computed separately). For small precisions, multiple precision libraries typically implement integer multiplications

by cross multiplying all limbs

and summing the results while handling carries appropriately.

3.2Redundant representations

Unfortunately, carry propagation is unpleasant to vectorize and SIMD units typically do not provide “vector carry registers” (except for certain GPUs that handle this via mask registers). In a SIMD world, it is therefore more convenient to work with redundant representations: we still represent integers as polynomials

in some radix

, but the limbs

are stored with a higher precision

(which may be different for multiplicands and products). If

, then

and the representation is said to be normalized. We will call

the bit precision and

the bit capacity. The difference

is the number nail bits in the terminology of GMP.

On recent Intel Xeon processors there are two instructions

and

for integer FMAs (IFMA). Formally, given

and

, we have

where

and

denote the quotient and the remainder of the euclidean division of

. The IFMA representation is the redundant representation with

and

. Given a two-limb integer

, we can accumulate

the product of two limbs

using one

and one

. As long as

, this computation does not overflow, but

typically does not remain normalized.

On even more recent Intel Xeon processors with AMX support, there is an instruction for matricial FMAs

, where

, and

. For individual integer coefficients of the product, this corresponds to working with a redundant AMX representation with

and

. Note that products of limbs are two limbs long for the IFMA representation, but only one limb long for the AMX representation. Indeed, when using AMX, products and multiplicands are represented using limbs of different

-bit and

-bit capacities, respectively.

3.3Naive multiple precision multiplication

Redundant representations have the advantage that the

nail bits can be used to accumulate carries, while postponing normalization to the very end.

Consider for instance the multiplication of

and

using integer FMAs, where

and

. Assume also that

. Then we first compute the non-normalized product

with

, and then normalize it, as follows:

This algorithm has been represented graphically in Figure 1; the lower and upper triangles correspond to

and

operations, respectively.

, then the mere highest

limbs of the result can be computed similarly. With the naive algorithm, the right picture in Figure 1 shows that this can be done at roughly half the cost of a full multiplication. This is very useful for fixed point arithmetic, by representing unsigned

-limb fixed point numbers as

-limb integers divided by

. It is interesting to note that the

nail bits can be exploited to represent small integer parts in this case.

However, our current implementation is limited to unsigned fixed point numbers. Signed variants can be obtained by representing signed integers

and

, so that

. Alternatively, one may use a two-complement representation for the highest limb and adapt multiple precision multiplication accordingly. But it becomes harder to exploit the extra

nail bits in both cases.

3.4Vector and matrix limbs

We briefly discussed pure SIMD vectorization of multiple precision integers in Section 2.2. The idea to represent such integers as polynomials in

with vectorial limbs also works for redundant representations. This allows us to apply the algorithms from the previous subsection directly to pure SIMD multiple precision integers in

In a similar way, matrices of multiple precision integers in

can be rewritten into polynomials in

with matrix limbs in

. Computing a non-normalized product

with

, and

can then be done naively using

AMX-accelerated matrix multiplications

. Note that we do not compute low and high parts separately in this case. When doing high

-limb multiplication in this way, this means that we need to do

multiplications, which is slightly more than one half of the number for full products.

3.5Karatsuba multiplication

For large

matrix multiplications with

-limb coefficients, using a naive algorithm for polynomial multiplication becomes suboptimal. If

is small, then it becomes better to use Karatsuba's algorithm [43].

Given a ring

, the traditional Karatsuba trick reduces one multiplication

with

into three multiplications in

and several additions and subtractions using the formulas

, and

. This trick generalizes to the case where

is replaced by

and

, in which case the multiplication of two polynomials

essentially reduces to three multiplications of polynomials in

. These latter polynomials can be computed recursively using the same trick.

When working with multiple precision integers

one technical difficulty is that

and

are in

and not necessarily in

. This makes it better to work with limbs of bit-size

instead of

. If our input integers used

-bit limbs, then this requires them to be rewritten into the

-bit limb representation and vice versa for the output. When using Karatsuba's algorithm recursively for

-limb integers, we may either directly use limbs of bit-size

(which yields a multiplication algorithm for

-bit integers), or do the limb rewritings recursively as well (which yields a multiplication algorithm for

-bit integers).

Karatsuba's trick can also be applied to matrices

. In that case, we need to do

multiplications in

, but only

additions and subtractions, which becomes negligible for large

. Asymptotically, we thus gain a factor

with respect to naive integer matrix multiplication.

Toom generalized Karatsuba's trick [62] and gave a way to reduce the multiplication of two polynomials in

multiplications in

at the cost of an increased loss of bit-precision for the integer variant. Interestingly, Karatsuba's trick can also be used directly for

, while losing a single bit of precision.

The idea (see also [36, Section 4.4.1], [44], [9, Exercise 1.4], and [58]) is to observe that the contribution of

can be computed using

, for all

. In this way, we need to first compute the

products

and then

further products of the form

, which yields a total number of

products. If we just want the high coefficients

, then we only need to do

multiplications. This is actually just as good as Toom's algorithm for

and almost as good for

and

3.6Chinese remaindering

A well known method for multiplying two matrices

is based on the Chinese remainder theorem. We first pick

pairwise coprime moduli

. Best is to choose them as large as possible while fitting into

bits. Let

and assume that

for some small

with

In order to compute

, we first reduce

and

modulo

for

. There is an elegant way to evaluate the map

using a vector-matrix product [13]:

Taking the limbs of the entries of

as our rows

, the computation of

for

thus essentially reduces to an

matrix product over

and similarly for

. We next compute

for

. We finally recover

from

by applying the Chinese remainder theorem. Consider the map

with

and

for

. We have

where

is the inverse of

modulo

for

. After precomputing the

-adic expansions

, it is again possible to compute

using a vector-matrix product [13]:

In a similar way as above, the recovery of

from

thus essentially boils down to an

matrix product over

. In order to ease the computation of the remainder with respect to

, it is sometimes preferred to first compute the remainders of

modulo

and then to work with the

-adic expansions of

instead of

. In this way, the quotient

is always bounded by

3.7Summary and asymptotic complexity

Each of the schemes that we described so far can be regarded as a way to reduce the multiplication

of two matrices in

multiplications

of matrices in

, through the use of an appropriate encoding:

The cost of the reduction may depend on

, but scales linearly with the size

of the matrices. Since the cost of

multiplications of matrices in

is proportional to

, the cost of the reduction is negligible for large

. For large matrices, it is therefore best to select the scheme for which

is minimal. The encoding/decoding process may also induce a loss of precision, which should be kept minimal as well.

Table summarizes the factors

and the loss of precision for the methods from the previous sections. For our implementations, we only considered the naive method, Karatsuba's method, and Chinese remaindering. In certain cases, Toom's algorithm or FFT-based algorithms may also be of interest.

4Multiplying small matrices using IFMA

In this section, we consider the problem of multiplying “small” integer matrices. Since AMX-acceleration is mainly interesting for quite large multiplications (of size at least

), we focus on SIMD-accelerated IFMA-based algorithms.

4.1Implementation strategy

Since we assume both

and

to be “small”, we will use naive algorithms for both matrix and polynomial multiplication. Here the qualifier “naive” is meant to capture the family of all algorithms that perform

pairs of

integer FMAs in total. However, there are many such “naive” algorithms and the precise order in which we do perform the integer FMAs matters, since this crucially influences how well we can keep coefficients in registers and how much pressure we put on registers.

Instead of implementing various multiplication strategies directly in a language like C++, our strategy is to rely on the JIL library [2, 40] for computations and JIT compilation of straight-line programs (SLPs): we implement a large amount of strategies and ways to combine strategies as symbolic programs. The JIL library provides highly efficient mechanisms to record the execution of these programs as SLPs and to JIT compile the resulting SLPs into highly efficient machine code. The recording plus compilation time is typically about 1000 times larger than a single execution of the generated code, which is one to three orders of magnitude faster than traditional compilers.

Let us illustrate this with a short example. We recall that an SLP is just a sequence of arithmetic operations. The operations belong to a fixed signature like

and we operate on a finite number of data fields that are all of the same type. This type could either be a scalar type like

or FP32, or a SIMD vector type like

. A finite number of the data fields are designated to be “input” and “output” fields.

Now given a generic function operating on data fields of the same time, but which may involve loops and function calls, JIL provides a way to record the trace of its execution as an SLP. In Figure 2, we illustrated the result of recording a C++ template for

matrix multiplication as an SLP.

When implementing multiplication algorithms in this way, the focus is not on the design of efficient C++ code templates, but rather on the design of clean templates, whose recording produces good SLPs. This makes it easy and efficient to build new strategies on top of other ones, like a multiplication algorithm for polynomials in

on top of a multiplication algorithm for matrices in

. JIL also provides standard mechanisms for composition and tensor products of maps, reindexation, etc. We refer to [ 40 ] for more information.

4.2Pure SIMD mode

Let us first consider the problem of multiplying integer matrices in the pure SIMD mode, using IFMA-based arithmetic. This means that our multiplicands are

matrices in

with coefficients that are SIMD multiple precision integers in

, whose limbs are SIMD vectors in

Two things matter for the design of efficient “naive” algorithms. We first should decide whether we view our multiplicands as matrices in

with SIMD integer entries or as polynomials in

with SIMD matrix “limbs”. The first point of view tends to be faster when

is large with respect to

, whereas the second one tends to be faster when

is large with respect to

Secondly, divide and conquer algorithms tend to reduce register spilling. For instance,

matrix products can recursively be reduced to smaller matrix products by halving

, or

. However, recursing too far down may increase register pressure. We implemented an ad hoc strategy that recurses until a certain threshold is reached (namely

). In the future, we plan to automatically generate many strategies of this kind and select the best one.

After implementing the desired multiplication strategy as a map in JIL, we record one evaluation of this map as an SLP, and JIT compile the SLP into machine code.

Remark 1. It is interesting to note that the SLP framework allows us to physically work with a specific memory layout (say

matrices with coefficients in

), while applying an algorithm that works logically with another representation (like polynomials in

with coefficients in

). Indeed, the alternative representation just corresponds to a permutation of coefficients in memory. JIL provides an abstraction (the aliased type in domain.hpp) that allows to automate this kind of re-interpretations (and which we indeed used for the implementation described in this section).

Table 2 shows the ratios of our timings for an

limb long

matrix multiplication divided by the theoretical time to the theoretical time to perform

integer FMAs over

(in theory, our processor has a throughput of two integer FMAs per cycle). The left-hand table shows the ratios for the standard representation; the right-hand one concerns the alternative representation of our multiplicands as polynomials in the radix

with matrix coefficients.

Interestingly, the observed ratios sometimes drops below the theoretical peak performance. This is probably due to subtleties with cycle counters in recent CPUs, in which different cores may run at different clock speeds. It might also indicate that the processor can sometimes pipeline IFMA instructions a bit better than officially announced.

We also recall that the generated codelet unrolls the entire computation. For

, this means that the generated program performs

IFMAs. The codelet therefore takes more than 12Mb in memory and does not even fit in the L3 cache. This explains the dramatic increase in the ratios for large

and/or

4.3Tinkered SIMD mode

Let us now turn to the computation of a single

matrix product

over

. In Section 2.2, we already described a reduction to the computation of a

matrix product

over

, whenever

is divisible by

. We may then use an algorithm in pure SIMD mode to compute this product.

In fact the situation is even a bit better than that: instead of casting

into a matrix

, it is better to regard

directly as an

SIMD matrix over

, and to replicate individual coefficients of

across lanes whenever needed. This more compact representation of

has the advantage that we may typically keep

wholly in registers, after which obtaining coefficients from

does not involve memory accesses.

is not divisible by

, then more work is required to benefit from SIMD acceleration: for an

product such that

is divisible by

, we may write

, and

for some

with

. We next redundantly encode

, and

by matrices

by taking

where

and

. Every SIMD multiplication in

then computes a

matrix product with coefficients in

. If

, then some of the lanes of

need to be summed in order to retrieve

The standard SLP signature in JIL includes instructions for permutations of lanes for SIMD base types (mathematically speaking, really maps on lanes, because replication of lanes is also allowed). Consequently, the tinkered SIMD multiplication algorithms can still be recorded into SLPs and compiled within the JIL framework. As a consequence of Remark 1, we note in addition that any of the pure SIMD multiplication strategies can naturally be used in conjunction with any of the tinkered encoding strategies.

Table 3 shows timings that we obtained, with similar conventions as for Table ? . This time, we divided the total number of cycles by

instead of

4.4Fixed point entries

One important application of integer matrix multiplication is linear algebra over fixed point numbers. In that case, we only need to compute the highest

limbs of the product, which saves about half of the computations. Table 4 reports on the obtained timings. We recall that a shortcoming of our implementation is that it is currently limited to unsigned fixed point numbers.

5Multiplying large matrices using IFMA

5.1Implementation strategy

The simplest way to accelerate a large matrix multiplication

via JIL is to consider

, and

as block matrices. We use JIL for the compilation of efficient SLPs to multiply or multiply-accumulate blocks. The block matrices themselves are multiplied using conventional C++ code. The last possibly incomplete row and column are treated using zero-padding or recursive decomposition. The block sizes should be carefully chosen and may differ for

, and

. By taking sufficiently large blocks, we intend to make the overhead of the C++ part of the code as small as possible, so that the total cost essentially boils down to multiplications of blocks.

One goal of the above strategy is that JIL can focus on compiling SLPs of a fixed size and that we can use a higher level language like C++ for the rest, with minimal overhead. However, for practical block sizes, the overhead of the C++ code is small, but not necessarily negligible, which makes it desirable to further improve the interface between JIT compiled SLPs and their use within high level C++ programs. For the purposes of this paper, we therefore extended JIL with a new promise abstraction, which allows us to JIT compile slightly more complex programs than mere SLPs.

For instance, an extremely common scenario is to execute the same SLP on a vector of

successive input and output chunks. This happens for example when encoding all

entries of an integer matrix in

into CRT format. JIL now contains a “loop promise” to cover this scenario. Our implementation first rewrites the SLP into a new one that can handle

entries at a time in SIMD fashion (this requires transpositions in memory to work in the SIMD representation), or even a multiple of

entries at a time via unrolling. In the future, we also plan to automatically determine and factor out loop invariants.

Back to matrix multiplication, another common scenario is that the input of the SLP has first to be gathered from memory and/or that the output needs to be scattered to memory, as we explained in Section 3.7 and (3). For instance, Chinese remaindering essentially reduces the multiplication of two matrices in

multiplications of matrices in

. Now it is more natural to CRT encode an input matrix in

as a vector of matrices in

rather than a matrix with vector entries in

. The latter case corresponds to the loop scenario that we described above, whereas the first one requires us to submit the output to an additional

matrix transposition over

. In order to reduce unnecessary memory accesses, we preferred to introduce a new scenario that scatters entries in

directly into the

desired matrices

while processing the CRTs.

5.2Relative timings with respect to theoretical peak performance

We have tested most of the strategies of Table 1 for the multiplication of large matrices using integer FMAs. For the “Multiply” step of (3), we use the blocking technique that we described above, and also a tinkered SIMD as in Section 4.3 for the multiplication of blocks. We report on relative timings with respect to the theoretical peak performance. Tables with absolute timings are given in Appendix A.3.

In Table 5, we consider long

limb multiplications of

matrices for the naive and the Chinese remaindering (CRT) strategies. More precisely, we show the number of cycles that it takes to compute the matrix product, divided by

, the theoretical number of cycles to perform all IFMA operations in the multiplication step of (3). In Table 6, we show timings for the non-recursive variant of Karatsuba's algorithm, both in the case of long

limb multiplication and high

limb multiplication.

The tables show that we are not far from the theoretical peak performance. The factors from Table 1 can therefore guide the design of efficient algorithms when

gets large. In particular, it is interesting to note that Karatsuba's algorithm is most efficient for high multiplication until five limbs.

There is still quite some room for further improvements. Our implementation of the naive algorithm was done first and based on a suboptimal implementation of matrix transposition for the encoding and decoding steps; this explains the poor performance for small

. In general, with more work, we expect that it should be possible to reach factors that are very close to

(at least for, say,

and

6Accelerating matrix multiplication using AMX

6.1Implementation strategy

Let

, and

, where

, and

are such that

is a multiple of four. Then the AMX extension offers an instruction for computing the matrix FMA

in 16 cycles if

and somewhat faster if

. In practice, the total number

of scalar FMAs per cycle is highest when using the maximal allowed dimensions

, and

, which is the main configuration used in our implementation.

For

, the matrices

, and

need to be loaded from memory and stored in dedicated tile registers. There are eight available tile registers of 1Kb each and there are dedicated strided load and store operations for tiles. In addition, the second multiplicand

is stored in a special VNNI format, which coincides neither with the row major nor column major formats.

Since the 1Kb size of tiles is very different from the 64 byte size of AVX512 registers, AMX and AVX512 instructions do not fit well together into the SLP framework of JIL. For this reason, we decided to start with a base C++ implementation of the matrix FMA (MFMA)

for arbitrary sizes

, using Intel instrinsics. In the future, we plan to extend JIL with special “promises” (supplementing the scenarios described in Section 5.1) for building code that gracefully mixes AMX and AVX512 instructions.

For now, the core of our work on AMX relies on an efficient implementation of MFMA for

bit entries. On top of this, we implemented multiple precision matrix multiplication using various schemes as in (3). The encoding and decoding stages are mostly done using AVX512 instructions and our main concern will be to keep the cost of these stages reasonable. This is challenging due to the fact that the throughput of AVX512 instructions is roughly sixteen times lower than the throughout of AMX instructions (for eight bit entries, we do

FMAs per 16 cycles with AMX and 64 operations per cycle with AVX512). We again heavily rely on SLPs from JIL and a similar extra layer as described in Section 5.1 for loops.

6.2Int8 matrix FMAs

Our 8-bit MFMA kernel follows a relatively standard design informed by the classic paper by Goto and van de Geij [27]; see also [64, Sec. 5.1] and the recommendations of the Intel Optimization Reference Manual [42, Chapter 20].

Like many implementations of general matrix multiplication on AMX units, it is based on a microkernel that multiplies a

tile block

from

with a

tile block

from

and accumulates the result in a

tile block

, using all eight tile registers. Assuming that this microkernel is run repeatedly with the four accumulation tiles fixed, it requires about four loads and four tile FMAs per iteration. While, in ideal circumstances, the architecture we target can sustain up to two tile loads from L1 cache (and one store) per tile FMA, the

pattern makes it easier to hide memory movements behind computations even when not all loads hit L1 cache.

The typical scenario we aim for is for

to reside in L1 cache while

resides in L2 cache, as it is still possible in that case to fully overlap the loads with the FMAs in the steady state. We access the

matrix using non-temporal load instructions so as not to needlessly evict data from

from L1 cache. In addition, we use the strided load instructions mentioned above to read the 16 rows making up a tile of

from their original, typically non-contiguous memory locations without first packing them together, and similarly for writing tiles back to

. Our experiments suggest that, for data aligned on a cache line and in the absence of cache conflicts, strided loads and stores are not significantly slower than contiguous ones (see however [50] for more on the limits of this strategy). We deal with odd numbers of tiles in either dimension using a variant of the same microkernel, and with incomplete

- or

-tiles using a temporary buffer and AVX512 masked loads and stores.

The microkernel is wrapped in loops over

(from outermost to innermost) that multiply a series of

tile blocks of

each fitting in L1 cache by a single block of

that fits in L2 cache. On a single core, we can typically process blocks of about

tiles in this fashion. The resulting macrokernel is itself wrapped in a cache-blocking kernel, using the same loop order, that repacks roughly L2-sized blocks of

into contiguous VNNI-encoded full tiles (padded with zeros if necessary) before processing them using the macrokernel. The encoder operates on blocks of up to

entries of

, corresponding to two adjacent tiles, using AVX512 instructions. Slightly surprisingly, we found this block format to perform better in our context than the alternative of

entries, even though the latter makes it possible to load full cache lines (rows of the

block, to be split among four tiles) at once. We speculate that this may be because the L1 cache can serve up to three 256-byte loads but only two 512-byte loads per cycle.

We also implemented some microkernels dedicated to the special cases where one of the dimensions of the product is less than or equal to one tile.

Remark 2. Endo, Ohshima and Nanri [19] recently proposed an alternative microkernel that reserves six tile registers for accumulating a

block of

and uses the remaining two to hold the current tiles of

and

. The tiles of

are loaded once per block and those of

are loaded twice. While this results in an fma/load ratio of only

, the idea is that two of the loads are virtually guaranteed to hit L1 cache, whereas, depending on the blocking strategy, all four of the tiles accessed by the

kernel may have been evicted by the time they are reused. While Endo et al. report reaching better performance with their

kernel than with the standard one, in our experiments, it performed worse except for

. Endo et al.'s code does not appear to be publicly available, but a plausible explanation would be that the use of non-temporal tile loads makes their technique unnecessary.

6.3Naive multiple precision multiplications

The most natural approach to build an algorithm for

-limb matrix multiplication on top of the work from the previous subsection is to rewrite

input matrices as elements in

. Then a long multiplication boils down to

matrix FMAs and a short (low or high) multiplication to

matrix FMAs.

Table 7 shows the factors between the performance of our implementation and the theoretical peak performance. If one is satisfied with a factor of three or less, then our current implementation will do. (In Table 22 of the appendix, we can see that the naive AMX-based implementation slightly outperforms our IFMA-based one.) But it seems significantly more difficult to approach the theoretical peak performance as well as we did for IFMA-based methods.

The variance of the timings obtained with AMX is quite large, with frequent outliers that can significantly alter mean timings. For this reason, our tables show the median instead of the mean timings over a large number of runs.

The cost of encoding and decoding using AVX512 is far from negligible. It seems hard to entirely get rid of this cost, since the algorithms from the previous subsection benefit from the fact that most of the matrices have a natural memory layout. Nonetheless, by interleaving the AMX instructions appropriately with the encoding and decoding, we expect that the factors can be reduced significantly.

increases, the cost of decoding and encoding should a priori become smaller with respect to the cost of the inner matrix multiplications. This is indeed what we observe for small values

(say

). However, for larger values of

, we start to suffer from cache misses. In the future, we plan to reduce these misses through a more careful interleaving of memory loads and stores with the actual computations. But again, part of the slow-down is probably unavoidable.

6.4Chinese remaindering

The implementation of the Chinese remaindering approach from section 3.6 for AMX-accelerated kernels is work in progress. We are currently working on an implementation in which the transforms (1) and (2) are also done using AMX instructions, by replacing the left-hand rows by huge matrices with one row per object to be transformed. Unfortunately, this means that two of the three matrices are far from being square, and AMX instructions are less efficient for such matrices. In addition, for various interesting bit-sizes like

, we cannot plainly profit from

multiplications and we have to resort to smaller configurations, such as

. For larger bit-sizes, we also run out of 8 bit primes. This forces us to resort to 16 bit primes, for which we pay a factor two (or to 14 bit primes with Karatsuba multiplication, which still leads to an overhead of

). In addition to the matrix multiplications, one also needs to do several modular reductions, overlapping sums, and permutations in memory using AVX512 instructions. Altogether, this makes the algorithms time consuming and tricky to implement. For now, we only completed the implementation of the direct transforms.

Table 8 contains some absolute timings. Comparing with Table 22 , we observe that the required CRT for a

matrix product over

is about 16 times faster than the current naive product. Since we need to do two direct CRTs and one inverse CRT (typically of doubled cost) and four times less modular matrix multiplications, we expect to gain a factor two in this case. The threshold for CRTs is thus around

Bibliography

[1]: A. Abdelfattah, J. Dongarra, M. Fasi, M. Mikaitis, and F. Tisseur. Analysis of floating-point matrix multiplication computed via integer arithmetic. 2026.
[2]: A. Ahlbäck, J. van der Hoeven, and G. Lecerf. JIL: a high performance library for straight-line programs. https://sourcesup.renater.fr/projects/jil, 2025.
[3]: Automatically Tuned Linear Algebra Software (ATLAS). https://math-atlas.sourceforge.net.
[4]: J. Berthomieu, S. Graillat, D. Lesnoff, and T. Mary. Multiword matrix multiplication over large finite fields in floating-point arithmetic. https://hal.science/hal-04917201, 2026.
[5]: D. Bini and V. Y. Pan. Polynomial and matrix computations. Vol. 1. Birkhäuser Boston Inc., Boston, MA, 1994. Fundamental algorithms.
[6]: BLAS (Basic Linear Algebra Subprograms). https://netlib.org/blas.
[7]: BLAS-like library instantiation software framework. https://github.com/flame/blis.
[8]: J. Bradbury, N. Drucker, and M. Hillenbrand. NTT software optimization using an extended Harvey butterfly. Cryptology ePrint Archive, Paper 2021/1396, 2021. https://eprint.iacr.org/2021/1396.
[9]: R. P. Brent and P. Zimmermann. Modern Computer Arithmetic. Cambridge University Press, 2010.
[10]: P. Cawley. Apple AMX instruction set. https://github.com/corsix/amx, 2024.
[11]: J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Comput., 19:297–301, 1965.
[12]: W. da Silva Pereira. Accelerating floating-point computations with Intel AMX. Technical Report NREL/TP-2C00-93622, National Renewable Energy Laboratory, 2025.
[13]: J. Doliskani, P. Giorgi, R. Lebreton, and É. Schost. Simultaneous conversions with the Residue Number System using linear algebra. Transactions on Mathematical Software, 44(3), 2018. Article 27.
[14]: J. J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16(1):1–17, March 1990.
[15]: J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft., 14(1):1–17, March 1988.
[16]: J.-G. Dumas, T. Gautier, M. Giesbrecht, P. Giorgi, B. Hovinen, E. Kaltofen, B. D. Saunders, W. J. Turner, and G. Villard. Linbox: a generic library for exact linear algebra. In First Internat. Congress Math. Software ICMS, pages 40–50. Beijing, China, 2002.
[17]: J.-G. Dumas, P. Giorgi, and C. Pernet. Dense linear algebra over word-size prime fields: the FFLAS and FFPACK packages. ACM Trans. Math. Softw.
[18]: T. Edamatsu and D. Takahashi. Accelerating large integer multiplication using intel AVX-512IFMA. In Algorithms and Architectures for Parallel Processing: 19th International Conference, ICA3PP 2019, Melbourne, Australia, pages 60–74. Springer-Verlag, 2019.
[19]: Y. Endo, S. Ohshima, and T. Nanri. Optimization of a GEMM implementation using Intel AMX. In Proc. of the Supercomputing Asia and Intern. Conf. on High Performance Computing in Asia Pacific Region, SCA/HPCAsia '26, pages 81–90. ACM, 2026.
[20]: D. Filho, G. Brandão, and J. López. Fast polynomial multiplication using matrix multiplication accelerators with applications to NTRU on Apple M1/M3 SoCs. IACR Communications in Cryptology, page 0, 2024.
[21]: P. Fortin, A. Fleury, F. Lemaire, and M. Monagan. High performance SIMD modular arithmetic for polynomial evaluation. ArXiv:2004.11571, 2020.
[22]: M. Frigo. A fast Fourier transform compiler. In Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, PLDI '99, pages 169–180. New York, NY, USA, 1999. ACM.
[23]: M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proc. IEEE, 93(2):216–231, 2005.
[24]: J. von zur Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, New York, NY, USA, 3rd edition, 2013.
[25]: E. Georganas, K. Banerjee, D. Kalamkar, S. Avancha, A. Venkat, M. Anderson, G. Henry, H. Pabst, and A. Heinecke. High-performance deep learning via a single building block. 2019.
[26]: G. Gerganov et al. Ggml. https://github.com/ggml-org/ggml, 2026.
[27]: K. Goto and R. A. van de Geijn. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software, 34(3), 2008.
[28]: T. Granlund et al. GMP, the GNU multiple precision arithmetic library. http://www.swox.com/gmp, 1991.
[29]: S. Gueron and V. Krasnov. Accelerating big integer arithmetic using intel IFMA extensions. In 2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH), pages 32–38. 2016.
[30]: J. A. Gunnels, G. M. Henry, and R. A. van de Geijn. A family of high-performance matrix multiplication algorithms. In Computational Science – ICCS 2001, pages 51–60. Berlin, Heidelberg, 2001. Springer.
[31]: G. Hanrot, V. Lefèvre, K. Ryde, and P. Zimmermann. MPFR, a C library for multiple-precision floating-point computations with exact rounding. http://www.mpfr.org, 2000.
[32]: William Hart et al. FLINT: fast library for number theory. https://flintlib.org, 2010.
[33]: D. Harvey and J. van der Hoeven. On the complexity of integer matrix multiplication. JSC, 89:1–8, 2018.
[34]: D. Harvey and J. van der Hoeven. Integer multiplication in time . Annals of Mathematics, 193(2):563–617, 2021.
[35]: D. Harvey and A. V. Sutherland. Computing Hasse–Witt matrices of hyperelliptic curves in average polynomial time. In Algorithmic Number Theory – Eleventh International Symposium (ANTS XI), volume 17 of London Mathematical Society Journal of Computation and Mathematics, pages 257–273. Cambridge University Press, 2014.
[36]: J. van der Hoeven. Relax, but don't be too lazy. JSC, 34:479–542, 2002.
[37]: J. van der Hoeven. The Jolly Writer. Your Guide to GNU TeXmacs. Scypress, 2020.
[38]: J. van der Hoeven and G. Lecerf. Faster FFTs in medium precision. In 22nd IEEE Symposium on Computer Arithmetic (ARITH), pages 75–82. June 2015.
[39]: J. van der Hoeven and G. Lecerf. Implementing number theoretic transforms. Technical Report, HAL, 2024. https://hal.science/hal-04841449, accepted for publication in AAECC.
[40]: J. van der Hoeven and G. Lecerf. Towards a library for straight-line programs. AAECC, 2026. https://doi.org/10.1007/s00200-025-00719-0.
[41]: J. van der Hoeven, G. Lecerf, and G. Quintin. Modular SIMD arithmetic in Mathemagix. ACM Trans. Math. Softw., 43(1):5–1, 2016.
[42]: Intel Corporation. Intel® 64 and IA-32 architectures optimization reference manual: volume 1. 2024.
[43]: A. Karatsuba and J. Ofman. Multiplication of multidigit numbers on automata. Soviet Physics Doklady, 7:595–596, 1963.
[44]: G. H. Khachatrian, M. K. Kuregian, K. R. Ispiryan, and J. L. Massey. Fast multiplication of integers for public-key applications. In S. Vaudenay and A. M. Youssef, editors, Selected Areas in Cryptography, pages 245–254. Berlin, Heidelberg, 2001. Springer.
[45]: D. Khaldi, Y. Luo, B. Yu, A. Sotkin, B. Morais, and M. Girkar. Extending LLVM IR for DPC++ matrix support: a case study with Intel® advanced matrix extensions (Intel® AMX). In 2021 IEEE/ACM 7th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), pages 20–26. 2021.
[46]: B. Kuzma, I. Korostelev, J. P. L. de Carvalho, J. E. Moreira, C. Barton, G. Araujo, and J. N. Amaral. Fast matrix multiplication via compiler-only layered data reorganization and intrinsic lowering. Software: Practice and Experience, 53(9):1793–1814, 2023.
[47]: Linear Algebra PACKage. https://www.netlib.org/lapack.
[48]: Ch. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Software, 5(3):308–323, 1979.
[49]: H.-J. Lebbink. Hjlebbink/amx-matmul. 2024. https://github.com/HJLebbink/AMX-matmul.
[50]: T. Li. Usstq/mm_amx. https://github.com/usstq/mm_amx, 2024.
[51]: T. M. Low, F. D. Igual, T. M. Smith, and E. S. Quintana-Orti. Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw., 43(2):12–1, 2016.
[52]: H. Martínez, A. Castelló, F. D. Igual, and E. S. Quintana-Ortí. The Cambrian explosion of mixed-precision matrix multiplication for quantized deep learning inference. Future Generation Computer Systems, page 108231, 2025.
[53]: C. S. Mummidi, V. C. Ferreira, S. Srinivasan, and S. Kundu. Highly efficient self-checking matrix multiplication on tiled AMX accelerators. ACM Transactions on Architecture and Code Optimization, 21(2):1–22, 2024.
[54]: Hiroyuki Ootomo, Katsuhisa Ozaki, and Rio Yokota. Dgemm on integer matrix multiplication unit. The Intern. J. of High Performance Comp. Appl., 38(4):297–313, 2024.
[55]: D. N. Parikh, R. A. van de Geijn, and G. M. Henry. Cascading gemm: high precision from low precision. 2023. https://arxiv.org/abs/2303.04353v2.
[56]: M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: code generation for DSP transforms. Proc. IEEE, 93(2):232–275, 2005. Special issue on “Program Generation, Optimization, and Adaptation”.
[57]: A. Schönhage and V. Strassen. Schnelle Multiplikation großer Zahlen. Computing, 7:281–292, 1971.
[58]: Michael Scott. Missing a trick: Karatsuba variations. Cryptography and Communications, 10(1):5–15, 2018.
[59]: V. Shoup. NTL: a library for doing number theory. 1996. www.shoup.net/ntl.
[60]: V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13:352–356, 1969.
[61]: The OpenBLAS developers. Openblas. http://www.openmathlib.org/OpenBLAS/, feb 2026.
[62]: A. L. Toom. The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Mathematics, 4(2):714–716, 1963.
[63]: Y. Uchino, K. Ozaki, and T. Imamura. High-performance and power-efficient emulation of matrix multiplication using INT8 matrix engines. In Proc. of the SC '25 Workshops of the Intern. Conf. for High Performance Comp., Networking, Storage and Analysis, pages 1824–1831. ACM, 2025.
[64]: F. G. Van Zee and R. A. van de Geijn. BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw., 41(3):14–1, 2015.
[65]: Q. Wang, X. Zhang, Y. Zhang, and Q. Yi. AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs. In Proc. of the Intern. Conf. on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 1–12. New York, NY, USA, 2013. ACM.
[66]: R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 27(1–2):3–35, 2001.
[67]: V. V. Williams, Y. Xu, Z. Xu, and R. Zhou. New bounds for matrix multiplication: from alpha to omega. In Proc. of the 2024 Annual ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 3792–3835. SIAM, 2024.

AAbsolute timings

As motivated in the introduction, all timings that we reported so far are relative to the theoretical peak performance. For completeness, this appendix reports on the corresponding absolute timings. We also added absolute timings for existing software.

A.1Some reference timings for existing software

Tables 9 to 14 present some timings obtained with Flint [32], NTL [59], and Fflas [17], three well-known algebraic computation libraries. Flint and Fflas are configured to link with OpenBlas. (We thank Pascal Giorgi for his help with Fflas.)

Since modular matrix multiplication has often benefited from more optimization effort than plain integer matrix multiplication, we include both matrices with entries in

and in

A.2Multiplying small matrices using IFMA

Tables 15, 16, and 17 are the absolute counterparts of Tables 2, 3, and 4 from Section 4. Note that Table 15 corresponds to the left-hand of Table 2 (for multiplicands in

), whereas Table 16 corresponds to the right-hand of Table 3 (for multiplicands that use the alternative representation in

A.3Multiplying large matrices using IFMA

Tables 18 and 19 present the absolute counterparts of the timings in Table 5 from Section 5 (the left-hand and right-hand tables respectively). Similarly, Tables 20 and 21 provide absolute counterparts for the timings in Table 6.

A.4Multiplying matrices using AMX

Table 22 shows absolute timings for naive AMX-accelerated matrix multiplication at various precisions; the corresponding relative timings were given in Table 7.