Modular SIMD arithmetic in Mathemagix

During the past decade, major manufacturers of microprocessors have changed their focus from ever increasing clock speeds to putting as many cores as possible on one chip and to lower power consumption. One approach followed by leading constructors such as Intel

and AMD is to put as many x86 compatible processors on one chip. Another approach is to rely on new simplified processing units, which allows to increase the number of cores on each chip. Modern Graphics Processing Units (GPU) have been designed according to this strategy.

As powerful as multicore architectures may be, this technology also comes with its drawbacks. Besides the increased development cost of parallel algorithms, the main disadvantage is that the high degree of concurrency allowed by multicore architecture often constitutes an overkill. Indeed, many computationally intensive tasks ultimately boil down to classical mathematical building blocks such as matrix multiplication or fast Fourier transforms (FFTs).

In many cases, the implementation of such building blocks is better served by simpler parallel architectures, and more particularly by the Single Instruction, Multiple Data (SIMD) paradigm. A limited support for this paradigm has been integrated into the x86 processor families with the Intel

MMX

and SSE (Streaming SIMD Extensions) technologies. More recently, the advent of Intel AVX (Advanced Vector Extensions) and AVX-512 has further empowered this model of computation. Historically speaking, vector computers have played an important role in “High Performance Computing” (HPC). An interesting question is whether more powerful SIMD instructions can also be beneficial to other areas.

More specifically, this paper concerns the efficiency of the SIMD approach for doing exact computations with basic mathematical objects such as integers, rational numbers, modular integers, polynomials, matrices, etc. Besides high performance, we are therefore interested in the reliability of the output. For this reason, we like to coin this area as High Quality Computing (HQC).

One important mathematical building block for HQC is fast modular arithmetic. For instance, products of multiple precision integers can be computed efficiently using FFTs over finite fields of the form

with

, where

typically fits within 32 or 64 bits, and

is as large as possible under this constraint [55]. Similarly, integer matrices can be multiplied efficiently using such FFTs or Chinese remaindering.

The aim of this paper is twofold. First of all, we adapt known algorithms for modular integer arithmetic to the SIMD model. We carefully compare various possible implementations as a function of the bit-size of the modulus and the available SIMD instructions. In the second part of the paper, we apply our implementations to several fundamental algorithms for exact arithmetic, namely to the multiplication of integers, integer matrices, modular polynomials and modular polynomial matrices. We show that our implementations based on modular arithmetic outperform existing implementations in many cases, sometimes by one order of magnitude.

The descriptions of the algorithms are self-contained and we provide implementation details for the SSE and AVX instruction sets. All implementations were done in C++ and our software is freely distributed within the computer algebra and analysis system Mathemagix [41, 42, 44, 45, 51] (revision

Related work

There is large literature on modular integer arithmetic. The main operation to be optimized is multiplication modulo a fixed integer

. This requires an efficient reduction of the product modulo

. The naive way of doing this reduction would be to compute the remainder of the division of the product by

. However, divisions tend to be more expensive than multiplications in current hardware. The classical solution to this problem is to pre-compute the inverse

so that all divisions by

can be transformed into multiplications.

In order to compute the product

and

modulo

, it suffices to compute the integer floor part

of the rational number

and to deduce

. The most obvious algorithm is to compute

using floating point arithmetic. However, this approach suffers from a few drawbacks. First, in order to ensure portability, the processor should comply to the IEEE-754 standard, which ensures that rounding errors are handled in a predictable way. Secondly, the size of the mantissa can be twice as large as the size of the modulus. Back and forth conversions from and to integers may then be required, which are expensive on some platforms. For more details we refer the reader to [22, Chapter 14] and [3, 4]. Many recent processors contain fused multiply-add instructions (FMA), which facilitate taking full advantage of floating point operations; see for instance [60, Chapitre 14].

Barrett's algorithm [7] provides an alternative to the floating point approach. The idea is to rescale the floating point inverse of

and to truncate it into an integer type. For some early comparisons between integer and floating point operations and a branch-free variant, we refer to [32]. This approach is also discussed in [23, Section 16.9] for particular processors. For multiple precision integers, algorithms were given in [6, 39].

Another alternative approach is to precompute the inverse

of the modulus

as a

-adic integer. This technique, which is essentially equivalent to Montgomery's algorithm [52], only uses integer operations, but requires

to be odd. Furthermore, modular integers need to be encoded and decoded (with a cost similar to one modular product), which is not always convenient. Implementations have been discussed in [49]. A generalization to even moduli was proposed in [48]. It relies on computing separately modulo

and

(such that

is odd, and

) via Chinese remaindering. Unfortunately, this leads to a significant overhead for small moduli. Comparisons between Barrett's and Montgomery's product were given in [15] for values of

corresponding to a few machine words. A recent survey can be found in [54] with a view towards hardware implementation.

Modular arithmetic and several basic mathematical operations on polynomials and matrices have been implemented before on several types of parallel hardware (multicore architectures and GPUs); see for instance [8, 9, 29, 33]. Although the present article focuses on x86 compatible processor families and their recent extensions, the algorithms we describe are intended to be useful for all platforms supporting SIMD technology.

Our applications to matrix product over large polynomials or integers are central tasks to many algorithms in computer algebra [2, 12, 26, 50]. For a sample of recent targeted implementations, the reader might consult [30, 38, 53].

Our contributions

One difficulty with modular arithmetic is that the most efficient algorithms are strongly dependent on the bit-size of the modulus

, as well as the availability and efficiency of specific instructions in the hardware. This is especially so in the case of SIMD processors, for which the precise instruction sets still tend to evolve rather quickly. In this paper, we have attempted to be as exhaustive as possible: we implemented all classical strategies, while hand optimizing all algorithms as a function of the available SIMD instructions.

The fast sequential algorithms for modular reduction from the previous section all involve some branching in order to counterbalance the effect of rounding. Our first contribution is to eliminate all branching so as to make vectorization possible and efficient. Our second contribution is a complete implementation of the various approaches as a function of the available SIMD instructions. For Barett's and Montgomery's algorithms, we consider both the SSE 4.2 and AVX 2 instruction sets, for all types of supported integers. For the floating point approach, we have implementations with and without SSE 4.1, AVX, and FMA extensions. We finally provide detailed benchmarks for the performance of the different approaches in practice. The observed speedups with respect to the “scalar” versions (i.e. without vectorization) nearly reflect theoretical expectations.

High performance libraries such as GMP [31], MPIR [35], MPFR [24], FLINT [34] for computing with large integers, floating point numbers in arbitrary precision, polynomials and matrices are written in C and mostly aim at providing users with very fast mathematical operations. In fact, the internal representations and algorithms operating on them are not expected to be accessed nor replaced a posteriori. Instead, the C++ libraries of Mathemagix provide users with high level interfaces to data structures and mathematical objects, but also with interfaces to internal representations and algorithms. Users can assemble algorithms for specific purposes, fine tune thresholds, and can easily replace an algorithm by another one a posteriori. In Section 4, we explain the main design principles. In particular, we show how our modular numbers are implemented and how our approach makes it simple to develop SIMD variants of algorithms which benefit from hardware vectorization features.

A third contribution of this article is the application of our modular arithmetic to an SIMD version of the fast Fourier transform algorithm in a prime field of the form

, where

is a multiple of a large power of two. Our implementations outperform existing software and we report on timings and comparisons for integer, polynomial and matrix products.

Besides low level software design considerations, our main research goals concern algorithms for solving polynomial systems, for effective complex analysis, and error correcting codes. Mathemagix includes several of our recent algorithms [5, 10, 11, 43]. From our experience, a strict bottom-up approach to mathematical software design prevents users from implementing high level algorithms efficiently. In other words, assembling algorithms from a strict mathematical point of view does not always lead to the best performance. Interfaces to algorithms and the ability to reassemble them for higher level operations with a sufficient level of genericity turns out to be very useful. For instance multiplying integer matrices by performing integer products and additions in sequence according to the schoolbook algorithm can be improved if one has access to relatively generic implementations of the low level integer product sub-functions. This specific example is adressed in our last section.

Conventions

Throughout this article, timings are measured on a platform equipped with an Intel

Core^TM i7-4770 CPU @ 3.40 GHz and 8 GB of 1600 MHz DDR3. It runs the Jessie GNU Debian

operating system with a Linux

kernel version 3.12 in 64 bit mode. Care has been taken for avoiding CPU throttling issues while measuring timings. Nevertheless timings often vary in a non negligible range sometimes over 20% and we thus measure average timings. We compile mainly with GCC [27] version 4.8.2. For pseudocode we use the operators of the C99 language with their exact meanings and the usual integer types from inttypes.h. Note also that integer promotions do not invalidate the algorithms and proofs in this paper. We refer the reader to the C99 (ISO/IEC 9899:1999) standard Sections 6.2.5 and 6.3.1.1 for further details [16]. For simplicity we assume that the operator

always implements the right arithmetic shift.

Brief survey of SIMD technology

For completeness, we conclude this introduction with recalling basic facts on SIMD technology. This technology can be seen as a type of parallelism where multiple processing units simultaneously execute the same instruction on multiple data. We can think of an SIMD processor as a single CPU which operates on registers which are really vectors of scalar data. The main basic characteristics of SIMD processors are the following:

Modern SIMD processors typically support both floating point types (float and double) and integer types (8, 16, 32, and 64-bit integers). For a scalar type

of bit-size

, we operate on vectors of

coefficients in

Currently, the most common SIMD technologies are SSE (with

and

), AVX (with

and

) and AVX-512 (with

and

). For instance, an AVX enabled processor can execute an operation on a vector of

integers of bit-size

in unit time. It should be noticed that these bit-sizes

are rather modest in comparison to their historical ancestors, such as the CDC star-100 and Cray

vector computers (CDC was a trademark of Control Data Corporation). One advantage of shorter vectors is that it remains feasible to provide instructions for permuting the entries of vectors in quite general ways (i.e. “communications” remain relatively inexpensive, whenever needed).

In this paper, our SIMD algorithms are described using compiler intrinsics. Most of the time, the semantics of the SIMD types and instructions in this paper is quite straightforward. Let us give a few examples:

We notice that all floating point arithmetic conforms to the IEEE-754 standard. In particular, results of floating point operations are obtained through correct rounding of their exact mathematical counterparts, where the global rounding mode can be set by the user. Some processors also provide fused multiply add (FMA) and subtract instructions _mm_fmadd_pd and _mm_fmsub_pd which are useful in Section 3. Another less obvious but useful instruction that is provided by some processors is:

For more details about SSE and AVX intrinsics we refer to the Intel

intrinsic guides [46, 47], and also to [21] for useful comments and practical recommendations.

Let us mention that a standard way to benefit from SIMD technology is to rely on auto-vectorization features of compilers. Languages such as C and C++ have not yet been extended to support SIMD technology in a standard and more explicit manner. In particular, current SIMD types are not even bona fide C/C++ types. Programming via intrinsics contains a certain number of technical pitfalls. In particular streaming load and store instructions are slowed down whenever memory addresses are not aligned on the size of the vectors. It is therefore left to the programmer to wrap memory allocation functions or to rely on specific features of the compiler.

2.Modular operations via integer types

and

are nonnegative integers, then

and

represent the quotient and the remainder in the long division of

respectively. We let U denote an unsigned integer type of bit-size

assumed to be at least

for convenience. This means that all the integers in the range from

can be represented in U by their binary expansion. The corresponding type for signed integers of bit-size

is written I: all integers from

can be represented in I. The type

represents unsigned integers of size at least

. Let

be a nonnegative integer of bit-size at most

with

. For efficiency reasons we consider that

and

are quantities known at compilation time, whereas the actual value of

is only known at execution time. In fact, deciding which elementary modular arithmetic algorithm to use at runtime according to the bit-size of

is of course too expensive. For convenience we identify the case

to computing modulo

. Modulo

integers are stored as their representative in

2.1.Modular sum

Although modular sum seems rather easy at first look, we quickly summarize the known techniques and pitfalls.

2.1.1.Unvectorized implementations

Given

and

modulo

, in order to compute

, we can first compute

in U and subtract

when an overflow occurs or when

, as detailed in the following function:

Input. Integers and modulo .

U add_mod (U , U ) {

U = + ;

if ( < ) return - ;

return ( >= ) ? - : ; }

Of course, when

, no overflow occurs in the sum of line 1, and line 2 is useless. If branching is more expensive than shifting and if

, then one can compute I = + - and return + (( >> (-1)) & ), where we recall that

implements the right arithmetic shift. It is important to program both versions and determine which approach is the fastest for a given processor. Negation and subtraction can be easily implemented in a similar manner.

2.1.2.Implementations with SSE 4.2 and AVX 2

Arithmetic operations on packed integers are rather well supported by SSE 4.2, uniformly for various types of integers. Let

represent the packed

-bit integer of type __m128i, whose entries are filled with

. In order to avoid branching in Function 1, one can compute U = + and return min (, - ) when

. This approach can be straightforwardly vectorized for packed integers of bit-sizes

, 16 and 32, as exemplified in the following function:

Input. Packed 32-bit integers and modulo , assuming .

__m128i add_mod_1_epu32 (__m128i , __m128i ) {

__m128i = _mm_add_epi32 (, );

return _mm_min_epu32 (, _mm_sub_epi32 (, )); }

Since the min operation does not exist on packed 64-bit integers, we use the following function, where

represents the packed

-bit integer of type __m128i filled with

Input. Packed 64-bit integers and modulo , assuming .

__m128i add_mod_1_epu64 (__m128i , __m128i ) {

__m128i = _mm_sub_epi64 (_mm_add_epi64 (, ), );

__m128i = _mm_cmpgt_epi64 (, );

return _mm_add_epi64 (, _mm_and_si128 (, )); }

, we can proceed as follows:

equals

if, and only if,

. In the latter case

can be obtained as

, and otherwise as

. If

, an overflow only occurs when

. Nevertheless, max (, - ) equals

when computed in U and - ( - ) is the correct value. These calculations can be vectorized for packed integers of bit-size

, 16, and

as follows:

Input. Packed 32-bit integers and modulo .

__m128i add_mod_epu32 (__m128i , __m128i ) {

__m128i = _mm_sub_epi32 (, );

__m128i = _mm_cmpeq_epi32 (_mm_max_epu32 (, ), );

__m128i = _mm_andnot_si128 (, );

return _mm_add_epi32 (_mm_sub_epi32 (, ), ); }

The minimum operator and comparisons do not exist for packed 64-bit integers so we declare the function _mm_cmpgt_epu64 (__m128i , __m128i ) as:

_mm_cmpgt_epi64 (_mm_sub_epi64 (, ), _mm_sub_epi64 (, ))

where

represents the packed 64-bit integer filled with

. The modular addition can be realized as follows:

Input. Packed 64-bit integers and modulo .

__m128i add_mod_epu64 (__m128i , __m128i ) {

__m128i = _mm_add_epi64 (, );

__m128i = _mm_or_si128 (_mm_cmpgt_epu64 (, ),

_mm_cmpgt_epu64 (, - ));

return _mm_sub_epi64 (, _mm_and_si128 (, )); }

2.1.3.Timings

Table 1 displays timings for arithmetic operations over integer types of all possible bit-sizes

supported by the compiler. Timings are the average number of clock cycles when applying the addition on two vectors of byte-size 4096 aligned in memory, and writing the result into the first vector. In particular, timings include load and store instructions. The row Scalar corresponds to disabling vectorization extensions with the command line option -fno-tree-vectorize of the compiler. For better performance, loops are unrolled and the optimization option -O3 of the compiler is used. For conciseness, we restrict tables to addition; subtraction and negation behave in a similar way. Operations which are not supported are indicated by n/a, meaning Non Applicable. In the other rows of the table we indicate timings obtained by the vectorized algorithms according to the instruction set. Let us mention that the vectorization process is not left to the compiler. In fact we implemented a dedicated framework in Mathemagix, which is detailed in Section 4.

Table 2 shows timings for modular sums. In absence of vectorization, 8-bit and 16-bit arithmetic operations are in fact performed by 32-bit operations. Indeed, 8-bit and 16-bit arithmetic is handled in a suboptimal way by current processors and compilers. For the vectorized implementations, we observe significant speedups when

. Nevertheless, when

, the lack of performance is not dramatic enough to justify the use of larger integers and double memory space.

2.2.Barrett's product

The first modular product we describe is the one classically attributed to Barrett. It has the advantage to operate on integer types with no assumption on the modulus. For any nonnegative real number

, we write

for the largest integer less or equal to

, and

for the smallest integer greater or equal to

. We use the following auxiliary quantities:

Since

, the integer

fits in U. We call

the pre-inverse of

. Since

, the computation of

just requires one division in

. In this subsection both integers

and

are assumed to be of type U. We describe below how to set suitable values for

and

in terms of

(see Table 3).

2.2.1.Reduction

Let

be one more auxiliary quantity with

. If

is an integer such that

, Barrett's algorithm computes

as follows:

Input. An integer such that , with , , , , and as above.

U reduce_barrett (L ) {

L = >> ;

L = ( * ) >> ;

L = - * ;

while ( >= ) = - ;

return ; }

Proof. From

, it follows that

. Therefore,

has bit-size at most

, and

. Since

the product

fits in L. From

we obtain

. If

, then

, so that

, whence the conclusion follows. If

, then

, and we still have

is an integer modulo

, and

and

are sequences of integers modulo

, computing

is a central task to matrix and polynomial products. In order to minimize the number of reductions to be done, we wish to take

as large as possible. In general, in Barrett's algorithm, we can always take

and

, which leads to

. When

, we can achieve

if we restrict to

with

and

. When

, it is even better to take

and

so that

when

. When

, then we let

and

when

to get

. These possible settings are summarized in Table 3.

We could consider taking

instead of

. From

and

we obtain

. Therefore the inequalities

imply that

fits in U. Using

instead of

in Function 6 leads to the following inequalities:

, then the term

disappears. If

is sufficiently small, then

can be bounded by

. Therefore line 4 of Function 6 can be discarded. More precisely, if we assume that

and letting

, and

, then we have the following faster function:

Input. An integer with , and , , , as above.

U reduce_barrett_half (U ) {

U = ( * ((L) )) >> ;

return - * ; }

Remark 3. With recent compilers, 128-bit integers are available, which eases this implementation up to 64-bit. But if U is the largest type of integers supported by the compiler, then

has to be computed by alternative methods such as the classical Newton-Raphson division [57] (see also our implementation in numerix/modular_int.hpp inside Mathemagix, with the necessary proofs in the documentation).

2.2.2.Several products by the same multiplicand

Assume that we wish to compte several modular products, where one of the multiplicands is a fixed modular integer

. A typical application is the computation of FFTs, where we need to multiply by roots of unity that only depend on the size of the transformer and can thus be computed once and cached into memory. Then one can pre-compute

and obtain a speed-up within each product by

thanks to the following function:

Input. Integers , , and in , with and as above.

U mul_mod_barrett (U , U y, U ) {

U = ( * ((L) )) >> ;

L = ((L) ) * - ((L) ) * ;

return ( >= ) ? - : ; }

Notice that Function 8 does not depend on

. If

, then line 2 can be replaced by U = * - *. If

, then

fits in U since

, and Function 8 can be improved along the same lines as Function 7:

Input. Integers , , in , with and as above, and where .

U mul_mod_barrett_half (U , U y, U ) {

U = ( * ((L) )) >> ;

return * - * ; }

2.2.3.Implementations with SSE 4.2 and AVX 2

Function 6 could be easily vectorized if all the necessary elementary operations were available within the SSE 4.2 or AVX 2 instruction sets. Unfortunately this is not so for all integer sizes, which forces us to examine different cases. As a first remark, in order to remove the branching involved in line 4 of Function 6, we replace it by = min (, - ) as many times as specified by Proposition 1. In the functions below we consider the case when

. The other case is more straightforward. The packed

-bit integer filled with

(resp.

) seen of type L is written

(resp.

). The constant

corresponds to the packed

-bit integer filled with

. We start with the simplest case of packed 16-bit integers. If

and

are the two vectors of scalar type U to be multiplied modulo

, then we unpack each of them into two vectors of scalar type L, we perform all the needed SIMD arithmetic over L, and then we pack the result back to vectors over U.

Input. Packed 16-bit integers and modulo , assuming .

__m128i mul_mod_2_epu16 (__m128i , __m128i ) {

__m128i = _mm_unpacklo_epi16 (, );

__m128i = _mm_unpackhi_epi16 (, );

__m128i = _mm_unpacklo_epi16 (, );

__m128i = _mm_unpackhi_epi16 (, );

__m128i = _mm_mullo_epi32 (, );

__m128i = _mm_srli_epi32 (, );

__m128i = _mm_srli_epi32 (_mm_mullo_epi32 (, ), );

__m128i = _mm_packus_epi32 (, );

__m128i = _mm_sub_epi16 (_mm_mullo_epi16 (, ),

_mm_mullo_epi16 (, );

return _mm_min_epu16 (, _mm_sub_epi16 (, ));

, then the computations up to line 10 are the same but the lines after are replaced by:

Under the only assumption that

, ones needs to duplicate latter lines 13 and 14.

For packed 8-bit integers since no packed product is natively available, we simply perform most of the computations over packed 16-bit integers as follows:

Input. Packed 8-bit integers and modulo , assuming .

__m128i muladd_mod_2_epu8 (__m128i , __m128i ) {

__m128i = _mm_unpacklo_epi8 (, );

__m128i = _mm_unpackhi_epi8 (, );

__m128i = _mm_unpacklo_epi8 (, );

__m128i = _mm_unpackhi_epi8 (, );

__m128i = _mm_mullo_epi16 (, );

__m128i = _mm_srli_epi16 (, );

__m128i = _mm_srli_epi16 (_mm_mullo_epi16 (, ), );

__m128i = _mm_sub_epi16 (, _mm_mullo_epi16 (, ));

__m128i = _mm_packus_epi16 (, );

return _mm_min_epu16 (, _mm_sub_epi8 (, ));

does not hold, then the same modifications as with packed 16-bit integers are applied.

For packed

-bit integers, a similar extension using packing and unpacking instructions is not possible. We take advantage of the _mm_mul_epu32 instruction. If

, then we use the following code:

Input. Packed 32-bit integers , modulo , assuming .

__m128i mul_mod_2_epu32 (__m128i , __m128i ) {

__m128i = _mm_mul_epu32 (, );

__m128i = _mm_srli_epi64 (, );

__m128i = _mm_srli_epi64 (_mm_mullo_epi32 (, ), );

__m128i = _mm_srli_si128 (, 4);

__m128i = _mm_mul_epu32 (, );

__m128i = _mm_srli_epi64 (, );

__m128i = _mm_srli_epi64 (_mm_mullo_epi32 (, ), );

__m128i = _mm_blend_epi16 (, _mm_slli_si128 (, 4),4+8+64+128);

__m128i = _mm_or_si128 (, _mm_slli_si128 (, 4));

__m128i = _mm_sub_epi32 (, _mm_mullo_epi32 (, ));

return _mm_min_epu32 (, _mm_sub_epi32 (, ));

When

, the same kind of modifications as before have to be done:

must be computed with vectors of unsigned 64-bit integers. Since these vectors do not support the computation of the minimum, one has to use _mm_cmpgt_epi64.

2.2.4.Timings

In Table 4 we display timings for multiplying machine integers, using the same conventions as in Table 1. Recall that packed

-bit integers have no dedicated instruction for multiplication: it is thus done through the 16-bit multiplication via unpacking/packing.

Table 5 shows the performance of the above algorithms. The row “Naive” corresponds to the scalar approach using the C++ operator % to compute remainders. Up to 32-bit integers, arithmetic is handled via 32 and 64-bit registers. The row “Barrett” contains timings for the scalar implementation of Barrett's product. Notice that the case

is significantly faster than the case

In the scalar approach, compiler optimizations and hardware operations are not always well supported for small sizes, so it makes sense to perform computations on a larger size. On our test platform, 32-bit integers typically provide the best performance. The resulting timings are given in the row “padded Barrett”. For 8-bit integers, the best strategy is in fact to use lookup tables yielding

cycles in average, but this strategy cannot be vectorized. Finally the last two rows correspond to the vectorized versions of Barrett's approach.

For larger integers, the performance is shown in Table 6. Let us mention that in the row “Barrett” with

we actually make use of __int128 integer arithmetic.

Table 7 shows the average cost of one product with a fixed multiplicand. In comparison with Table 5, we notice a significant speedup, especially for the vectorial algorithms.

Remark 6. Our vectorized modular operations can easily be adapted to simultaneously compute modulo several moduli, say

, assuming they share the same parameters

and

. Nevertheless, instead of using vectors

filled with the same modulus, this requires one vector of type L to be filled with

and

and a second one with

and

; the same consideration holds for

. These modifications involve to cache more pre-computations and a small overhead in each operation.

2.3.Montgomery's product

For Montgomery's algorithm [52], one needs to assume that

is odd. Let

be such that

and let

. We need the auxiliary quantities

and

defined by

, and

. They can be classically computed with the extended Euclidean algorithm [26, Chapter 3].

2.3.1.Reduction

For convenience we introduce

. The core of Montgomery's algorithm is the following reduction function:

Input. An integer such that , with odd, and as above.

U reduce_montgomery (U ) {

U = ( * ) & ;

L = + * ;

U = >> m;

if ( < ) return - ;

return ( >= ) ? - : ; }

Proof. First one verifies that

. Therefore

is a multiple of

. If

, then no overflow occurs in the sum of line 2, and the division in line

is exact. We then have

, and the correctness follows from

and thus

. If

, then the casual overflow in line 2 is tested in line 4, and

is the value in U of

Let

be a modulo

integer. We say that

is in Montgomery's representation if stored as

. The product of two modular integers

and

, of respective Montgomery's representations

and

, can be obtained in Montgomery's representation

by applying Function 13 to

since

, then the mask in line 1 can be avoided, and the shift in line 3 might be more favorable than a general shift, according to the compiler. In total, if

, Montgomery's approach is expected to be faster than Barrett's one. Otherwise costs should be rather similar. Of course these cost considerations are rather informal and the real cost very much depends on the processor and the compiler.

Remark 8. As for Barrett's algorithm one could be interested in simplifying Montgomery's product when performing several products by the same multiplicand

. Writing

, the only simplification appears in line 1, where

can be obtained as

, which saves one product in U. Therefore Barrett's approach is expected to be always faster for this task. Precisely, if

, this is to be compared to Function 8 which performs only one high product in line 1.

Remark 9. Up to our best knowledge, it is not known whether Montgomery's product can be improved in a similar way as we did for Barrett's product, in the case when

2.3.2.Timings

Table 8 contains timings measured in the same manner as in the previous subsection. Compared to Tables 5 and 6 we observe that Montgomery's product is not interesting in the scalar case for 8 and 16-bit integers but it outperforms Barrett's approach in larger sizes, especially when

. Compared to Table 7, Montgomery's product is only faster for when

and

3.Modular operations via numeric types

Instead of integer types, we can use numeric types such as float or double to perform modular operations. Let us write F such a type, and let

represent the size of the mantissa of F minus one, i.e. 23 bits for float and 52 bits for double. This number of bits corresponds to the size of the so called trailing significant field of F, which is explicitly stored. The modular product of

and

begins with the computation of an integer approximation

. Then

is an approximation of

at distance

. The constant hidden behind the latter

depends on rounding modes. In this section we conform to IEEE-754 standard. We first analyze the case when

fills less than half of the mantissa. We next propose a general algorithm using the fused multiply-add operation.

3.1.Notations

The following notations are used until the end of the present section. We write

for the value of

computed in F by applying the division operator on 1.0 and (F) . Still following IEEE-754, the trailing significant field of

is written

and its exponent

. These quantities precisely depend on the active rounding mode used to compute

. But for all rounding modes, they satisfy the following properties:

Let

be a real number of type F, and let F = * be the approximation of

computed in F. Again, independently of the rounding mode, there exist integers

and

, for the trailing significant field and the exponent of

, such that

We also use the approximation

of 1.0 / ((F) ) computed in F with rounding mode set towards infinity, so that

holds.

3.2.Reduction in half size

When

fills no more than half of the mantissa, that is when

, it is possible to perform modular products in F easily. Let the floor function return the largest integral value less than or equal to its argument, as defined in C99.

Input. An integer such that , where , and .

F reduce_numeric_half (F ) {

F = * ;

F = floor ();

F = - * ;

if ( >= ) return - ;

if ( < 0) return + ;

return ; }

Proof. Using (1) and (2) we obtain

, hence the floor function actually returns

. From the decomposition

If using

instead of

, and if the rounding mode is set towards infinity, then

, and then

, which allows us to discard line 4.

In the same way we did for Barrett's product, and under mild assumptions, we can improve the latter function whenever

Input. An integer such that , where and .

Hypothesis. The current rounding mode rounds towards infinity.

F reduce_numeric_half_1 (F ) {

F = * ;

F = floor ();

return - * ; }

By (2) and (4), the claim is satisfied as soon as

, which is itself implied by

, that is correct since it rewrites into

by expanding the product.

Remark 12. The hypothesis on the rounding mode can be dropped if

does not divide

. This is in particular case, if

is prime and

is the product of two numbers in

. Indeed, if

, then the algorithm is clearly correct. If

does not divide

, then we have

. Combining the latter inequality with

and

yields

3.3.Larger modular products via FMA

Until now we have not exploited the whole mantissa of F. To release the assumption

in Function 14, the value for

could be computed over a sufficiently large integer type. Over double one can use

-bit integers, as implemented for instance in [58]. The drawbacks of this approach are the extra conversions between numeric and integer types, and the fact that the vectorization is compromised since

-bit integer products are not natively available in the SSE or AVX instruction sets. In what follows we describe a modular product in the case when

, using the fused multiply-add operation from the IEEE 754-2008 standard. We write fma (, , ) for the evaluation of

using the current rounding mode.

Input. Integers and such that , where , and .

F mul_mod_fma (F , F ) {

F = * ;

F = fma (, , );

F = * ;

F = floor ();

F = fma (, , );

F = + ;

if ( >= ) return - ;

if ( < 0) return + ;

return ; }

Proof. Let

. We have

, so that

. We also verify that

, which implies that

. Using (5) again, the following inequalities hold:

From (2) we have

, and from (4) we deduce

. It follows that

. In particular this implies

, whence

. This proves that

and therefore

, which finally implies the correctness of Function 16.

Now suppose that

is used and that the rounding mode is set towards infinity. Then

and

so that

, and therefore

Remark 14. Let I be a type of signed integers of at least

bits. In the above functions the executable code generated for the floor instruction heavily depends on the compiler, its version, and its command line arguments. This makes timings for this numeric approach rather unpredictable. We have run tests with GCC version

4.7 and CLANG [17] version

3.4 and observed that = (F) ((I) ) always generates cvttsd2si, cvtsi2sd instructions which are the x86 cast instructions, whereas = floor () is compiled into a call to the floor function from the math library. In order to force the compiler to use a special purpose instruction such as roundsd from SSE 4.1, the -O3 -msse4.1 -fno-trapping-math arguments must be passed to gcc. In the case of clang, the options -O3 -msse4.1 are sufficient.

3.4.Vectorized implementations

For efficiency, we assume that SSE 4.1 is available. Our modular addition and subtraction functions can benefit from _mm_blendv_pd as follows:

Input. Packed doubles and modulo , assuming .

__m128d add_mod_1 (__m128d , __m128d ) {

__m128d = _mm_add_pd (, );

__m128d = _mm_sub_pd (, );

return _mm_blendv_pd (, , ); }

Assuming that FMA instructions are available, Function 16 is implemented as follows:

Input. Packed doubles and modulo , assuming .

__m128d mul_mod_1 (__m128d , __m128d ) {

__m128d = _mm_mul_pd (, );

__m128d = _mm_fmsub_pd (, , );

__m128d = _mm_mul_pd (, );

__m128d = _mm_floor ();

__m128d = _mm_fnmadd_pd (, , );

__m128d = _mm_add_pd (, );

__m128d = _mm_sub_pd (, );

= _mm_blendv_pd (, , );

= _mm_add_pd (, );

return _mm_blendv_pd (, , ); }

3.5.Timings

Table 9 is obtained under similar conditions as Tables 1 and 4, but for the types float and double. The row “Scalar” corresponds to the unvectorized implementation with unrolled loops, and preventing the compiler from using auto-vectorization. The next rows concern timings using SSE 4.1 and AVX instructions with unrolled loops on aligned data.

In Table 10 we have shown timings for the modular sum and product functions from this section. The row “Scalar” excludes auto-vectorization and does not use the processor built-in FMA unit.

We notice that the timings in Table 10 are interesting when compared to those in Table 5. However, for multiplications with a fixed multiplicand, the approach from Table 7 becomes more attractive, and this will indeed be used in Section 5 below.

We also notice that efficient hardware quadruple precision arithmetic would allow us to consider moduli with larger bit sizes. An alternative to hardware quadruple precision arithmetic would be to provide an efficient “three-sum”

instruction with correct IEEE-754 rounding. This would actually allow for the efficient implementation of more general medium precision floating point arithmetic.

4.Implementation design in C++

The C++ libraries of Mathemagix are not only oriented towards mathematical interfaces but also towards reusability of generic low level implementations. Advanced users can not only build on existing algorithms, but also replace them a posteriori by more efficient implementations for certain special cases. We heavily rely on C++ template programming. This section surveys some of the design principles behind our implementation of modular numbers but also of polynomials, matrices and other objects in Mathemagix. Our techniques turn out to be especially useful for implementing the scalar and SIMD algorithms in a unified framework.

4.1.Data structures and top level functions

Consider a typical template class in Mathemagix, such as modulus, vector, matrix, polynomial or series. Besides the usual template parameters, such as a coefficient type C, such classes generally come with a special additional parameter V, called the implementation variant. The parameter V plays a similar role as traits classes in usual C++ terminology [1, Chapter 2]. The variant V does not impact the internal representation of instances of the class, but it does control the way in which basic functions manipulate such instances.

For example, the class vector<C,V> (defined in basix/vector.hpp) corresponds to dense vectors with entries in C, stored in a contiguous segment of memory. A vector of this type consists of a pointer of type C* to the first element of the vector, the size

of the vector, and the size

of the allocated memory. For the sake of simplicity we omit that our vectors are endowed with reference counters. At the top level user interface, for instance, the sum of two vectors is defined as follows:

template<typename C, typename V>
vector<C,V> operator + (const vector<C,V>& v, const vector<C,V>& w) {
  typedef implementation<vector_linear, V> Vec;
  nat n= N(v); nat l= aligned_size<C,V> (n);
  C* t= mmx_new<C> (l);
  Vec::add (t, seg (v), seg (w), n);
  return vector<C,V> (t, n, l); }

In this piece of code N(v) represents the size of

, aligned_size<C,V> (n) computes the length to be allocated in order to store vectors of size n over C in memory. According to the values of C and V, we can force the allocated memory segment to start at an address multiple of a given value, such as 16 bytes when using SSE instructions, or 32 bytes when AVX support is enabled. The function mmx_new<C> is a reimplementation in Mathemagix of the usual new function, which allows faster allocation of objects of small sizes. Finally, the data t, n, l are stored into an instance of vector<C,V>. The allocated memory is released once the vector is not used anymore, i.e. when its reference counter becomes zero. The core of the computations is implemented in the static member function add of the class implementation<vector_linear,V> explained in the next paragraphs. At the implementation level, operations are usually performed efficiently on the object representations. The expression seg (v) returns the first address of type C* of the memory segment occupied by v.

4.2.Algorithms and implementations

The classes containing the implementations are specializations of the following class:

The first template argument F is usually an empty class which names a set of functionalities. In the latter example we introduced vector_linear, which stands for the usual entry-wise vector operations, including addition, subtraction, component-wise product, etc. The value of the argument V for naive implementations of vector operations is vector_naive. The role of the third argument W will be explained later. The naive implementation of the addition of two vectors is then declared as a static member of implementation<vector_linear, vector_naive> as follows:

template<typename V>
struct implementation<vector_linear, V, vector_naive> {
  static inline void
  add (C* dest, const C* s1, const C* s2, nat n) {
    for (nat i= 0; i < n; i++)
      dest[i]= s1[i] + s2[i]; }                                    ../..

Four by four loop unrolling can for instance be implemented within another variant, say vector_unrolled_4, as follows:

template<typename V>
struct implementation<vector_linear, V, vector_unrolled_4> {
  static inline void
  add (C* dest, const C* s1, const C* s2, nat n) {
    nat i= 0;
    for (; i + 4 < n; i += 4) {
      dest[i  ]= s1[i  ] + s2[i  ]; dest[i+1]= s1[i+1] + s2[i+1];
      dest[i+2]= s1[i+2] + s2[i+2]; dest[i+3]= s1[i+3] + s2[i+3]; }
    for (; i < n; i++)
      dest[i]= s1[i] + s2[i]; }                                    ../..

When defining a new variant we hardly ever want to redefine the whole set of functionalities of other variants. Instead we wish to introduce new algorithms for a subset of functionalities, and to have the remaining implementations inherit from other variants. We say that a variant V inherits from W when the following partial specialization is active:

template<typename F, typename U>
struct implementation<F,U,V>: implementation<F,U,W> {};

For instance, if the variant V inherits from vector_naive, then the add function from implementation<vector_linear,V> is inherited from implementation<vector_linear, vector_naive>, unless the following partial template specialization is implemented: template<typename U> implementation<vector_linear,U,V>.

It remains to explain the role of the three parameters of implementation. In fact the second parameter keeps track of the top level variant from which V inherits. Therefore, in a static member of implementation<F,U,V>, when one needs to call a function related to another set of functionalities G, then it is fetched in implementation<G,U>. In order to illustrate the interest of this method, let us consider polynomials in the next paragraphs.

Our class polynomial<C,V> (defined in algebramix/polynomial.hpp) represents polynomials with coefficients in C using implementation variant V. Each instance of a polynomial is represented by a vector, that is a pointer with a reference counter to a structure containing a pointer to the array of coefficients of type C* with its allocated size, and an integer for the length of the considered polynomial (defined as the degree plus one). The set of functionalities includes linear operations, mainly inherited from those of the vectors (since the internal representations are the same), multiplication, division, composition, Euclidean sequences, subresultants, evaluations, Chinese remaindering, etc. All these operations are implemented for the variant polynomial_naive (in the file algebramix/polynomial_naive.hpp) with the most general but naive algorithms (with quadratic cost for multiplication and division).

The variant polynomial_dicho<W> inherits from the parameter variant W and contains implementations of classical divide and conquer algorithms: Karatsuba for the product, Sieveking's polynomial division, half-gcd, divide and conquer evaluation and interpolations [26, Chapters 8–11]. Polynomial product functions belong to the set of functionalities polynomial_multiply, and division functions to the set polynomial_divide. The division functions of polynomial_dicho<W> are implemented as static members of

template<typename U, typename W>
struct implementation<polynomial_divide,U,polynomial_dicho<W> >

They make use of the product functions of implementation<polynomial_multiply,U>.

Let us now consider polynomials with modular integer coefficients and let us describe how the Kronecker substitution product is plugged in. In a nutshell, the coefficients of the polynomials are first lifted into integers. The lifted integer polynomials are next evaluated at a sufficiently large power of two, and we compute the product of the resulting integers. The polynomial product is retrieved by splitting the integer product into chunks and reducing them by the modulus. For details we refer the reader for instance to [26, Chapter 8]. As to our implementation, we first create the new variant polynomial_kronecker<W> on top of another variant W (see file algebramix/polynomial_kronecker.hpp), which inherits from W, but which only redefines the implementation of the product in

the product functions in implementation<polynomial_multiply, K> correspond to the Kronecker substitution. The functions in implementation<polynomial_division, K> are inherited from

implementation<polynomial_division, K,
               polynomial_dicho<polynomial_naive> >

and thus use Sieveking's division algorithm. The divisions rely in their turn on the multiplication from implementation<polynomial_multiply, K>, which benefits from Kronecker substitution. In this way, it is very easy for a user to redefine a new variant and override given functions a posteriori, without modifying existing code.

Finally, for a given mathematical template type, we define a default variant as a function of the remaining template parameters. For instance, the default variant of the parameter V in vector<C,V> is typename vector_variant<C>::type which is intended to provide users with reasonable performance. This default value is set to vector_naive, but can be overwritten for special coefficients. The default variant is also the default value of the variant parameter in the declaration of vector. Users can thus simply write vector<C>. The same mechanism applies to polynomials, series, matrices, etc.

4.3.Modular integers

In Mathemagix, moduli are stored in the dedicated class modulus<C,V> (from numerix/modulus.hpp). This abstraction allows us to attach extra information to the modulus, such as a pre-inverse. Modular numbers are instances of modular<M,W> (from numerix/modular.hpp), where M is a modulus type, and W is a variant specifying the way how the modulus is stored (e.g. in a global variable, or as an additional field for each modular number). Scalar arithmetic over integer (resp. numeric) types is implemented in numerix/modular_int.hpp (resp. numerix/modular_double.hpp). For convenience, packed integer and numeric types are wrapped into C++ classes so that modular operations can easily be implemented for them. Let us mention that vectors of modular numbers have a specific implementation variant allowing to share the same modulus when performing entry-wise operations. In this way we avoid fetching the modulus for each arithmetic operation.

Operations on vectors of integer and numeric types are implemented in a hierarchy of variants. One major variant controls the way loops are unrolled. Another important variant is dedicated to memory alignement.

5.Fast Fourier transform and applications

In order to benefit from vectorized modular arithmetic within the fast Fourier transform, we implemented a vectorized variant of the classical in-place algorithm. In this section, we describe this variant and its applications to polynomial and matrix multiplication.

5.1.Vectorized truncated Fourier transform

Let

be a commutative field, let

with

, and let

be a primitive

-th root of unity, which means that

, and

for all

. Let

be a

-vector space. The fast Fourier transform (with respect to

) of an

-tuple

is the

-tuple

with

In other words,

, where

denotes the element

. If

has binary expansion

, then we write

for the bitwise mirror of

in length

. Following the terminology of [40], the truncated Fourrier transform (TFT) of an

-tuple

(with respect to

) is

The classical FFT and TFT algorithms (see [37, 40], for instance) do not directly exploit low level vectorization. Let

be a divisor of

. The following algorithm reduces one TFT of size

over

into one TFT over

of size

and into one TFT of size

over

Input
Output: for all , where , and .

For convenience we consider that for all .

Let for all .

Compute .
Let for all .
Compute .
Return .

Proof. Let

for

, so that we have

and

for all

. Let

and

. A straightforward calculation leads to

By [40, Theorem 1] step 1 can be done with

operations in

. Step 2 involves

operations and step 3 takes

more operations.

Inverting Algorithm 1 is straightforward: it suffices to invert steps from 4 to 1 and use the inverse of the TFT. If

, then Algorithm 1 can be used to compute the FFT. If

is taken to be the size corresponding to a machine vector, then most of the TFT can be performed using SIMD operations. For sizes of order at most a few kilobytes we actually use this strategy. In larger sizes it is preferable to take

much larger, for instance of order

, so that the TFT runs on rather large vectors. With

of order

, this algorithm is very close to the cache-friendly version of the TFT designed in [36].

A critical point in large sizes becomes the matrix transposition, necessary to reorganize data in steps 1 and 4 of Algorithm 1. We designed ad hoc cache-friendly SIMD versions for it. Table 11 reports on timings obtained in this way for

with

in Mathemagix, with Barrett's approach. All the necessary primitives roots and powers are pre-computed and cached in memory. We observe significant speed-ups for the SSE and AVX versions.

Table 12 concerns FFTs for

and

. We compare NTL 6.1.0 with our FFT implementation, relying on AVX and FMA.

For the sake of comparison, we also report on the performance of the FFT over complex numbers in double precision. Table 13 compares timings provided by the command test/bench bundled with FFTW version 3.3.4 [25] (configured with the –enable-avx option), to the Mathemagix implementation in algebramix/fft_split_radix.hpp.

5.2.Polynomial matrix product

One major application of the TFT is the computation of polynomial products. In Table 14 we provide timings for multiplying two polynomials of degrees

over

. For this task, we compare NTL and FLINT 2.4.3 [34] to our implementations. The row “Kronecker” corresponds to the classical Kronecker substitution algorithm [26, Section 8.4] that reduces the product to one integer product, for which we appeal to GMP version 6.0.0a [31]. The last row concerns the AVX 2 version of our modular TFT. Here an important fact must be noticed. Previous timings of TFT were obtained as average on several runs on the same input. An overhead thus applies when performing a polynomial product where two direct TFT and one inverse TFT have to be performed. Taking also into account zero padding and the entry-wise vector product in between, the total time sensibly exceeds three times the one displayed in Table 11.

is a field with sufficiently many primitive roots of unity of order a power of two, then two

matrices

and

with coefficients in

of degrees

can be multiplied by performing TFT transforms of length

on each of the coefficients of

and

, by multiplying

matrices over

, and finally by recovering the matrix product through coefficient-wise inverse transforms. This requires

operations in

, where

is the exponent of

matrix multiplication over

. In general, most of the time is spent in the matrix products over

. Nevertheless, if

remains sufficiently small with respect to

, then most of the time is spent in the fast Fourier transforms, and our method becomes most efficient. Table 15 compares this approach to naive multiplication, and to FLINT.

5.3.Integer matrix product

Another important application of fast Fourier transforms is integer multiplication. The method that we have implemented is based on Kronecker segmentation and the three-prime FFT (see [55] and [26, Section 8.3]). Let

and

be three prime numbers. The two integers

and

of at most

bits to be multiplied are split into chunks of a suitable bit-size

and converted into polynomials

and

of degrees

, with

and

. The maximum bit-size of the coefficients of the product

is at most

. The parameter

is taken such that

is minimal under the constraint that

. The polynomial

can then be recovered from its values computed modulo

and

using TFT multiplications. Table 16 shows the performance obtained by this approach for

and

. We compare with the timings of GMP.

Similarly to polynomial matrices, small matrices over large integers can be multiplied efficiently using modular TFTs. In Table 17, the row “Mathemagix” shows timings for this approach. The row “Naive” corresponds to the classical product calling GMP functions directly. We also compare to FLINT and LinBox version 1.3.2 [18] (using NTL and FFlas-FFpack 1.6.0 [19, 20]).

Conclusion

Nowadays SIMD technology has widely proved to be extremely efficient for numerical computations. A major conclusion of the present work is that modular integers, and therefore exact and symbolic computations can also greatly benefit from this technology via integer types, and even in small sizes. We believe that the dissemination of the recent AVX-512 and other forthcoming extensions with wider vectors will be very profitable.

The use of C++ as a programming language allowed us to develop generic template libraries inside Mathemagix, which are both efficient and uniform. Nevertheless, ensuring good performance for the implementations presented here was not an easy task. Depending on the presence or absence of specific SIMD instructions, we had to implement a large number of variants for the same algorithms. Template programming turns out to be essential to control loop unrolling and memory alignment. Furthermore, the efficiency of several elementary routines for small fixed sizes could only be ensured through hand optimization of very specific pieces of code. For instance, several FFTs in small fixed sizes, say the size of SIMD registers, cannot be easily vectorized optimally. Here hardware transposition routines would be helpful, for example for

matrices of 32-bit integers using AVX 2 registers. On the longer run, we expect that part of the work that we did by hand could be done by an optimizing compiler. One major challenge for compilers and the design of programming languages is the automatic generation of FFTW3-style codelets.

As a final comment we would like to emphasize that SIMD technology can be seen as a low level way to parallelize computations. We believe that bringing easy access to this technology in Mathemagix will be fruitful to non specialist computer algebra developers. Currently, such developers are faced to monolithic software such as Maple^TM [28] or Magma [14], which very well cover low and high user-level mathematical functions, but lack of general lower level programming facilities. Some recent C++ and Python projects [56] aim at providing frameworks for large scale scientific programming, such as [13, 59], but all gaps have not been filled yet. One example of an interesting practical challenge is the multiplication of dense matrices with large integer coefficients. Indeed, this was one of the programming contest of the PASCO 2010 conference, and our benchmarks in Section 5.3 show that there was indeed room for progress.

Bibliography

[1]: D. Abrahams and A. Gurtovoy. C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond. Addison Wesley, 2004.
[2]: A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The design and analysis of computer algorithms. Addison-Wesley series in computer science and information processing. Addison-Wesley Pub. Co., 1974.
[3]: R. Alverson. Integer division using reciprocals. In Proceedings of the Tenth Symposium on Computer Arithmetic, pages 186–190. IEEE Computer Society Press, 1991.
[4]: H. G. Baker. Computing A*B (mod N) efficiently in ANSI C. SIGPLAN Not., 27(1):95–98, 1992.
[5]: B. Bank, M. Giusti, J. Heintz, G. Lecerf, G. Matera, and P. Solernó. Degeneracy loci and polynomial equation solving. Accepted for publication to Foundations of Computational Mathematics. Preprint available from http://arxiv.org/abs/1306.3390, 2013.
[6]: N. Bardis, A. Drigas, A. Markovskyy, and J. Vrettaros. Accelerated modular multiplication algorithm of large word length numbers with a fixed module. In M. D. Lytras, P. Ordonez de Pablos, A. Ziderman, A. Roulstone, H. Maurer, and J. B. Imber, editors, Organizational, Business, and Technological Aspects of the Knowledge Society, volume 112 of Communications in Computer and Information Science, pages 497–505. Springer Berlin Heidelberg, 2010.
[7]: P. Barrett. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In A. Odlyzko, editor, Advances in Cryptology – CRYPTO' 86, volume 263 of Lect. Notes Comput. Sci., pages 311–323. Springer Berlin Heidelberg, 1987.
[8]: D. J. Bernstein, Hsueh-Chung Chen, Ming-Shing Chen, Chen-Mou Cheng, Chun-Hung Hsiao, Tanja Lange, Zong-Cing Lin, and Bo-Yin Yang. The billion-mulmod-per-second PC. In SHARCS'09 Special-purpose Hardware for Attacking Cryptographic Systems: 131, 2009. http://cr.yp.to/djb.html.
[9]: D. J. Bernstein, Tien-Ren Chen, Chen-Mou Cheng, Tanja Lange, and Bo-Yin Yang. ECM on graphics cards. In A. Joux, editor, Advances in Cryptology - EUROCRYPT 2009, volume 5479 of Lect. Notes Comput. Sci., pages 483–501. Springer Berlin Heidelberg, 2009.
[10]: J. Berthomieu, G. Lecerf, and G. Quintin. Polynomial root finding over local rings and application to error correcting codes. Appl. Alg. Eng. Comm. Comp., 24(6):413–443, 2013.
[11]: J. Berthomieu, J. van der Hoeven, and G. Lecerf. Relaxed algorithms for -adic numbers. J. Théor. Nombres Bordeaux, 23(3):541–577, 2011.
[12]: D. Bini and V. Y. Pan. Polynomial and Matrix Computations: Fundamental Algorithms. Progress in Theoretical Computer Science. Birkhauser Verlag GmbH, 2012.
[13]: Boost team. Boost (C++ libraries). Software available at http://www.boost.org, from 1999.
[14]: W. Bosma, J. Cannon, and C. Playoust. The Magma algebra system. I. The user language. J. Symbolic Comput., 24(3-4):235–265, 1997.
[15]: A. Bosselaers, R. Govaerts, and J. Vandewalle. Comparison of three modular reduction functions. In D. R. Stinson, editor, Advances in Cryptology — CRYPTO' 93, volume 773 of Lect. Notes Comput. Sci., pages 175–186. Springer Berlin Heidelberg, 1994.
[16]: British Standards Institution. The C standard: incorporating Technical Corrigendum 1: BS ISO/IEC 9899/1999. John Wiley, 2003.
[17]: CLANG, a C language family frontend for LLVM. Software available at http://clang.llvm.org, from 2007.
[18]: J.-G. Dumas, T. Gautier, C. Pernet, and B. D. Saunders. LinBox founding scope allocation, parallel building blocks, and separate compilation. In K. Fukuda, J. van der Hoeven, M. Joswig, and N. Takayama, editors, Mathematical Software – ICMS 2010, volume 6327 of Lect. Notes Comput. Sci., pages 77–83. Springer Berlin Heidelberg, 2010.
[19]: J.-G. Dumas, P. Giorgi, and C. Pernet. FFPACK: Finite field linear algebra package. In J. Schicho, editor, Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation, ISSAC '04, pages 119–126. ACM Press, 2004.
[20]: J.-G. Dumas, P. Giorgi, and C. Pernet. Dense linear algebra over word-size prime fields: The FFLAS and FFPACK packages. ACM Trans. Math. Softw., 35(3):19:1–19:42, 2008.
[21]: A. Fog. Instruction tables. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. http://www.agner.org/optimize, Copenhagen University College of Engineering, 2012.
[22]: A. Fog. Optimizing software in C++. An optimization guide for Windows, Linux and Mac platforms. http://www.agner.org/optimize, Copenhagen University College of Engineering, 2012.
[23]: A. Fog. Optimizing subroutines in assembly language. An optimization guide for x86 platforms. http://www.agner.org/optimize, Copenhagen University College of Engineering, 2012.
[24]: L. Fousse, G. Hanrot, V. Lefèvre, P. Pélissier, and P. Zimmermann. MPFR: A multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Software, 33(2), 2007. Software available at http://www.mpfr.org.
[25]: M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proc. IEEE, 93(2):216–231, 2005.
[26]: J. von zur Gathen and J. Gerhard. Modern computer algebra. Cambridge Univ. Press, 2nd edition, 2003.
[27]: GCC, the GNU Compiler Collection. Software available at http://gcc.gnu.org, from 1987.
[28]: K. Geddes, G. Gonnet, and Maplesoft. Maple (TM). http://www.maplesoft.com/products/maple, from 1980.
[29]: P. Giorgi, Th. Izard, and A. Tisserand. Comparison of modular arithmetic algorithms on GPUs. In B. Chapman, F. Desprez, G. R. Joubert, A. Lichnewsky, F. Peters, and Th. Priol, editors, Parallel Computing: From Multicores and GPU's to Petascale, volume 19 of Advances in Parallel Computing, pages 315–322. IOS Press, 2010.
[30]: P. Giorgi and R. Lebreton. Online order basis algorithm and its impact on block Wiedemann algorithm. In Proceedings of the 2014 International Symposium on Symbolic and Algebraic Computation, ISSAC '14. ACM Press, 2014. To appear.
[31]: T. Granlund et al. GMP, the GNU multiple precision arithmetic library, from 1991. Software available at http://gmplib.org.
[32]: T. Granlund and P. L. Montgomery. Division by invariant integers using multiplication. In Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation, PLDI '94, pages 61–72, New York, NY, USA, 1994. ACM Press.
[33]: Sardar Anisul Haque and Marc Moreno Maza. Plain polynomial arithmetic on GPU. Journal of Physics: Conference Series, 385(1):012014, 2012.
[34]: W. Hart and the FLINT Team. FLINT: Fast Library for Number Theory, from 2008. Software available at http://www.flintlib.org.
[35]: W. Hart and the MPIR Team. MPIR, Multiple Precision Integers and Rationals, from 2010. Software available at http://www.mpir.org.
[36]: D. Harvey. A cache-friendly truncated FFT. Theoret. Comput. Sci., 410(27–29):2649–2658, 2009.
[37]: D. Harvey and D. S. Roche. An in-place truncated Fourier transform and applications to polynomial multiplication. In S. M. Watt, editor, Proceedings of the 2010 International Symposium on Symbolic and Algebraic Computation, ISSAC '10, pages 325–329, New York, NY, USA, 2010. ACM Press.
[38]: D. Harvey and A. V. Sutherland. Computing Hasse–Witt matrices of hyperelliptic curves in average polynomial time. Algorithmic Number Theory 11th International Symposium (ANTS XI), 2014. To appear.
[39]: W. Hasenplaugh, G. Gaubatz, and V. Gopal. Fast modular reduction. In P. Kornerup and J.-M. Muller, editors, 18th IEEE Symposium on Computer Arithmetic, ARITH '07, pages 225–229. IEEE Computer Society, 2007.
[40]: J. van der Hoeven. The truncated Fourier transform and applications. In J. Schicho, editor, Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation, ISSAC '04, pages 290–296. ACM Press, 2004.
[41]: J. van der Hoeven and G. Lecerf. Interfacing Mathemagix with C++. In M. Monagan, G. Cooperman, and M. Giesbrecht, editors, Proceedings of the 2013 International Symposium on Symbolic and Algebraic Computation, ISSAC '13, pages 363–370. ACM Press, 2013.
[42]: J. van der Hoeven and G. Lecerf. Mathemagix User Guide. HAL, 2013.
http://hal.archives-ouvertes.fr/hal-00785549.
[43]: J. van der Hoeven and G. Lecerf. On the bit-complexity of sparse polynomial and series multiplication. J. Symbolic Comput., 50:227–254, 2013.
[44]: J. van der Hoeven, G. Lecerf, B. Mourain, Ph. Trébuchet, J. Berthomieu, D. Diatta, and A. Mantzaflaris. Mathemagix, the quest of modularity and efficiency for symbolic and certified numeric computation. ACM SIGSAM Communications in Computer Algebra, 177(3), 2011. In Section "ISSAC 2011 Software Demonstrations", edited by M. Stillman, p. 166–188.
[45]: J. van der Hoeven, G. Lecerf, B. Mourrain, et al. Mathemagix, 2002–2014. Software available at http://www.mathemagix.org.
[46]: Intel Corporation. Intel (R) intrinsics guide. Version 3.0.1, released 7/23/2013.
http://software.intel.com/en-us/articles/intel-intrinsics-guide.
[47]: Intel Corporation, 2200 Mission College Blvd., Santa Clara, CA 95052-8119, USA. Intel (R) Architecture Instruction Set Extensions Programming Reference, 2013. Ref. 319433-015.
http://software.intel.com/en-us/intel-isa-extensions.
[48]: Ç. Kaya Koç. Montgomery reduction with even modulus. IEE Proceedings - Computers and Digital Techniques, 141(5):314–316, 1994.
[49]: Ç. Kaya Koç, T. Acar, and Jr. Kaliski, B. S. Analyzing and comparing Montgomery multiplication algorithms. Micro, IEEE, 16(3):26–33, 1996.
[50]: D. E. Knuth. The Art of Computer Programming, Volume 2: Seminumerical Algorithms. Pearson Education, 3rd edition, 1997.
[51]: G. Lecerf. Mathemagix: towards large scale programming for symbolic and certified numeric computations. In K. Fukuda, J. van der Hoeven, M. Joswig, and N. Takayama, editors, Mathematical Software - ICMS 2010, Third International Congress on Mathematical Software, Kobe, Japan, September 13-17, 2010, volume 6327 of Lect. Notes Comput. Sci., pages 329–332. Springer, 2010.
[52]: P. L. Montgomery. Modular multiplication without trial division. Math. Comp., 44(170):519–521, 1985.
[53]: M. Moreno Maza and Y. Xie. FFT-Based Dense Polynomial Arithmetic on Multi-cores. In D. J. K. Mewhort, N. M. Cann, G. W. Slater, and T. J. Naughton, editors, High Performance Computing Systems and Applications, volume 5976 of Lect. Notes Comput. Sci., pages 378–399. Springer Berlin Heidelberg, 2010.
[54]: N. Nedjah and L. de Macedo Mourelle. A review of modular multiplication methods and respective hardware implementations. Informatica, 30(1):111–129, 2006.
[55]: J. M. Pollard. The fast Fourier transform in a finite field. Math. Comp., 25(114):365–374, 1971.
[56]: G. van Rossum and J. de Boer. Interactively testing remote servers using the Python programming language. CWI Quarterly, 4(4):283–303, 1991. Software available at http://www.python.org.
[57]: M. J. Schulte, J. Omar, and E. E. Jr. Swartzlander. Optimal initial approximations for the Newton-Raphson division algorithm. Computing, 53(3-4):233–242, 1994.
[58]: V. Shoup. NTL: A Library for doing Number Theory, 2014. Software, version 6.1.0.
http://www.shoup.net/ntl.
[59]: W. A. Stein et al. Sage Mathematics Software. The Sage Development Team, from 2004. Software available at http://www.sagemath.org.
[60]: E. Thomé. Théorie algorithmique des nombres et applications à la cryptanalyse de primitives cryptographiques. http://www.loria.fr/~thome/files/hdr.pdf, 2012. Mémoire d'habilitation à diriger des recherches de l'Université de Lorraine, France.

Registered trademarks. Intel, Xeon, Intel Core, and MMX are trademarks of Intel Corporation in the U.S. and/or other countries. AMD is a trademark of Advanced Micro Devices, Inc. Debian is a registered trademark of Software in the Public Interest, Inc. in the U.S., the European Union, China, Japan, and Brazil. Linux is a registered trademark of Linus Torvalds in the U.S. and other countries. Cray is a trademark of Cray Computer Corporation in the U.S. and other countries. Maple is a trademark of Waterloo Maple Inc. Other trademarks used on this article are the property of their respective owners.

1.Introduction

Motivations

Related work

Our contributions

Conventions

Brief survey of SIMD technology

2.Modular operations via integer types

2.1.Modular sum

2.1.1.Unvectorized implementations

2.1.2.Implementations with SSE 4.2 and AVX 2

2.1.3.Timings

2.2.Barrett's product

2.2.1.Reduction

2.2.2.Several products by the same multiplicand

2.2.3.Implementations with SSE 4.2 and AVX 2

2.2.4.Timings

2.3.Montgomery's product

2.3.1.Reduction

2.3.2.Timings

3.Modular operations via numeric types

3.1.Notations

3.2.Reduction in half size

3.3.Larger modular products via FMA

3.4.Vectorized implementations

3.5.Timings

4.Implementation design in C++

4.1.Data structures and top level functions

4.2.Algorithms and implementations

4.3.Modular integers

5.Fast Fourier transform and applications

5.1.Vectorized truncated Fourier transform

5.2.Polynomial matrix product

5.3.Integer matrix product

Conclusion

Bibliography