
Modular integer arithmetic occurs in many algorithms for
computer algebra, cryptography, and error correcting codes.
Although recent microprocessors typically offer a wide range
of highly optimized arithmetic functions, modular integer
operations still require dedicated implementations. In this
article, we survey existing algorithms for modular integer
arithmetic, and present detailed vectorized counterparts. We
also present several applications, such as fast modular
Fourier transforms and multiplication of integer polynomials
and matrices. The vectorized algorithms have been implemented
in C++ inside the free computer algebra and analysis system

Categories and Subject Descriptors: G.4 [Mathematical Software]: Parallel and vector implementations
General Terms: Algorithms, Performance
Additional Key Words and Phrases: modular integer arithmetic, fast Fourier transform, integer product, polynomial product, matrix product, Mathemagix
During the past decade, major manufacturers of microprocessors have
changed their focus from ever increasing clock speeds to putting as many
cores as possible on one chip and to lower power consumption. One
approach followed by leading constructors such as
As powerful as multicore architectures may be, this technology also comes with its drawbacks. Besides the increased development cost of parallel algorithms, the main disadvantage is that the high degree of concurrency allowed by multicore architecture often constitutes an overkill. Indeed, many computationally intensive tasks ultimately boil down to classical mathematical building blocks such as matrix multiplication or fast Fourier transforms (FFTs).
In many cases, the implementation of such building blocks is better
served by simpler parallel architectures, and more particularly by the
Single Instruction, Multiple Data (
More specifically, this paper concerns the efficiency of the
One important mathematical building block for
The aim of this paper is twofold. First of all, we adapt known
algorithms for modular integer arithmetic to the
The descriptions of the algorithms are selfcontained and we provide
implementation details for the
There is large literature on modular integer arithmetic. The main operation to be optimized is multiplication modulo a fixed integer . This requires an efficient reduction of the product modulo . The naive way of doing this reduction would be to compute the remainder of the division of the product by . However, divisions tend to be more expensive than multiplications in current hardware. The classical solution to this problem is to precompute the inverse so that all divisions by can be transformed into multiplications.
In order to compute the product of and modulo ,
it suffices to compute the integer floor part of
the rational number and to deduce . The most obvious algorithm is to compute
using floating point arithmetic. However, this approach suffers from a
few drawbacks. First, in order to ensure portability, the processor
should comply to the
Barrett's algorithm [7] provides an alternative to the floating point approach. The idea is to rescale the floating point inverse of and to truncate it into an integer type. For some early comparisons between integer and floating point operations and a branchfree variant, we refer to [32]. This approach is also discussed in [23, Section 16.9] for particular processors. For multiple precision integers, algorithms were given in [6, 39].
Another alternative approach is to precompute the inverse of the modulus as a adic integer. This technique, which is essentially equivalent to Montgomery's algorithm [52], only uses integer operations, but requires to be odd. Furthermore, modular integers need to be encoded and decoded (with a cost similar to one modular product), which is not always convenient. Implementations have been discussed in [49]. A generalization to even moduli was proposed in [48]. It relies on computing separately modulo and (such that is odd, and ) via Chinese remaindering. Unfortunately, this leads to a significant overhead for small moduli. Comparisons between Barrett's and Montgomery's product were given in [15] for values of corresponding to a few machine words. A recent survey can be found in [54] with a view towards hardware implementation.
Modular arithmetic and several basic mathematical operations on
polynomials and matrices have been implemented before on several types
of parallel hardware (multicore architectures and
Our applications to matrix product over large polynomials or integers are central tasks to many algorithms in computer algebra [2, 12, 26, 50]. For a sample of recent targeted implementations, the reader might consult [30, 38, 53].
One difficulty with modular arithmetic is that the most efficient
algorithms are strongly dependent on the bitsize of the modulus , as well as the availability and efficiency of
specific instructions in the hardware. This is especially so in the case
of
The fast sequential algorithms for modular reduction from the previous
section all involve some branching in order to counterbalance the effect
of rounding. Our first contribution is to eliminate all branching so as
to make vectorization possible and efficient. Our second contribution is
a complete implementation of the various approaches as a function of the
available
High performance libraries such as
A third contribution of this article is the application of our modular
arithmetic to an
Besides low level software design considerations, our main research
goals concern algorithms for solving polynomial systems, for effective
complex analysis, and error correcting codes.
Throughout this article, timings are measured on a platform equipped
with an
For completeness, we conclude this introduction with recalling basic
facts on
The scalar data types which are supported;
The total bitsize of vector registers;
The number of vector registers;
For each scalar data type , the instruction set for vector operations over ;
The instruction set for other operations on
Modern
Currently, the most common
In this paper, our
The types __m128i and __m128d
correspond to packed 128bit integer and floating point vectors.
They are supported by
The instruction _mm_add_epi64 corresponds to the addition of two vectors of 64bit signed integers of type __m128i (so the vectors are of length ).
The predicate _mm_cmpgt_epi64 corresponds to a componentwise test on two vectors of 64bit signed integers of type __m128i. The boolean results true and false are encoded by the signed integers 1 and 0.
The instruction _mm_mul_pd corresponds to the multiplication of two vectors of double precision floating point numbers of type __m128d (so the vectors are of length ).
We notice that all floating point arithmetic conforms to the IEEE754 standard. In particular, results of floating point operations are obtained through correct rounding of their exact mathematical counterparts, where the global rounding mode can be set by the user. Some processors also provide fused multiply add (FMA) and subtract instructions _mm_fmadd_pd and _mm_fmsub_pd which are useful in Section 3. Another less obvious but useful instruction that is provided by some processors is:
_mm_blendv_pd (, , ) returns a vector with whenever the most significant bit of is set and otherwise (for each ). For floating point numbers, it should be noticed that the most significant bit of corresponds to the sign bit.
For more details about
Let us mention that a standard way to benefit from
If and are nonnegative integers, then and represent the quotient and the remainder in the long division of by respectively. We let U denote an unsigned integer type of bitsize assumed to be at least for convenience. This means that all the integers in the range from to can be represented in U by their binary expansion. The corresponding type for signed integers of bitsize is written I: all integers from to can be represented in I. The type represents unsigned integers of size at least . Let be a nonnegative integer of bitsize at most with . For efficiency reasons we consider that and are quantities known at compilation time, whereas the actual value of is only known at execution time. In fact, deciding which elementary modular arithmetic algorithm to use at runtime according to the bitsize of is of course too expensive. For convenience we identify the case to computing modulo . Modulo integers are stored as their representative in .
Although modular sum seems rather easy at first look, we quickly summarize the known techniques and pitfalls.
Given and modulo , in order to compute , we can first compute in U and subtract when an overflow occurs or when , as detailed in the following function:
Function
U = + ;
if ( < ) return  ;
return ( >= ) ?  : ; }
Of course, when , no overflow occurs in the sum of line 1, and line 2 is useless. If branching is more expensive than shifting and if , then one can compute I = +  and return + (( >> (1)) & ), where we recall that implements the right arithmetic shift. It is important to program both versions and determine which approach is the fastest for a given processor. Negation and subtraction can be easily implemented in a similar manner.
Arithmetic operations on packed integers are rather well supported by
Function
__m128i = _mm_add_epi32 (, );
return _mm_min_epu32 (, _mm_sub_epi32 (, )); }
Since the min operation does not exist on packed 64bit integers, we use the following function, where represents the packed bit integer of type __m128i filled with :
Function
__m128i = _mm_sub_epi64 (_mm_add_epi64 (, ), );
__m128i = _mm_cmpgt_epi64 (, );
return _mm_add_epi64 (, _mm_and_si128 (, )); }
If , we can proceed as follows: equals if, and only if, . In the latter case can be obtained as , and otherwise as . If , an overflow only occurs when . Nevertheless, max (,  ) equals when computed in U and  (  ) is the correct value. These calculations can be vectorized for packed integers of bitsize , 16, and as follows:
Function
__m128i = _mm_sub_epi32 (, );
__m128i = _mm_cmpeq_epi32 (_mm_max_epu32 (, ), );
__m128i = _mm_andnot_si128 (, );
return _mm_add_epi32 (_mm_sub_epi32 (, ), ); }
The minimum operator and comparisons do not exist for packed 64bit integers so we declare the function _mm_cmpgt_epu64 (__m128i , __m128i ) as:
_mm_cmpgt_epi64 (_mm_sub_epi64 (, ), _mm_sub_epi64 (, ))
where represents the packed 64bit integer filled with . The modular addition can be realized as follows:
Function
__m128i = _mm_add_epi64 (, );
__m128i = _mm_or_si128 (_mm_cmpgt_epu64 (, ),
_mm_cmpgt_epu64 (,  ));
return _mm_sub_epi64 (, _mm_and_si128 (, )); }
It is straightforward to adapt these functions to the
Table 1 displays timings for arithmetic operations over
integer types of all possible bitsizes
supported by the compiler. Timings are the average number of clock
cycles when applying the addition on two vectors of bytesize 4096
aligned in memory, and writing the result into the first vector. In
particular, timings include load and store instructions. The row
Scalar corresponds to disabling vectorization extensions with
the command line option fnotreevectorize of the
compiler. For better performance, loops are unrolled and the
optimization option O3 of the compiler is used. For
conciseness, we restrict tables to addition; subtraction and negation
behave in a similar way. Operations which are not supported are
indicated by



Table 2 shows timings for modular sums. In absence of vectorization, 8bit and 16bit arithmetic operations are in fact performed by 32bit operations. Indeed, 8bit and 16bit arithmetic is handled in a suboptimal way by current processors and compilers. For the vectorized implementations, we observe significant speedups when . Nevertheless, when , the lack of performance is not dramatic enough to justify the use of larger integers and double memory space.



The first modular product we describe is the one classically attributed to Barrett. It has the advantage to operate on integer types with no assumption on the modulus. For any nonnegative real number , we write for the largest integer less or equal to , and for the smallest integer greater or equal to . We use the following auxiliary quantities:
the nonnegative integer of with ;
nonnegative integers and such that and ;
the integer represents an approximation of a rescaled numerical inverse of .
Since , the integer fits in U. We call the preinverse of . Since , the computation of just requires one division in . In this subsection both integers and are assumed to be of type U. We describe below how to set suitable values for and in terms of (see Table 3).
Let be one more auxiliary quantity with . If is an integer such that , Barrett's algorithm computes as follows:
Function
L = >> ;
L = ( * ) >> ;
L =  * ;
while ( >= ) =  ;
return ; }
Proof. From , it follows that . Therefore, has bitsize at most , and . Since the product fits in L. From
If is an integer modulo , and and are sequences of integers modulo , computing is a central task to matrix and polynomial products. In order to minimize the number of reductions to be done, we wish to take as large as possible. In general, in Barrett's algorithm, we can always take , and , which leads to . When , we can achieve if we restrict to with and . When , it is even better to take and so that when . When , then we let and when to get . These possible settings are summarized in Table 3.



We could consider taking instead of . From and we obtain . Therefore the inequalities imply that fits in U. Using instead of in Function 6 leads to the following inequalities:
If , then the term disappears. If is sufficiently small, then can be bounded by . Therefore line 4 of Function 6 can be discarded. More precisely, if we assume that and letting , , and , then we have the following faster function:
Function
U = ( * ((L) )) >> ;
return  * ; }
Remark
Assume that we wish to compte several modular products, where one of the
multiplicands is a fixed modular integer . A
typical application is the computation of
Function
U = ( * ((L) )) >> ;
L = ((L) ) *  ((L) ) * ;
return ( >= ) ?  : ; }
Notice that Function 8 does not depend on , , or . If , then line 2 can be replaced by U = *  * . If , then fits in U since , and Function 8 can be improved along the same lines as Function 7:
Function
U = ( * ((L) )) >> ;
return *  * ; }
Function 6 could be easily vectorized if all the necessary
elementary operations were available within the
Function
__m128i = _mm_unpacklo_epi16 (, );
__m128i = _mm_unpackhi_epi16 (, );
__m128i = _mm_unpacklo_epi16 (, );
__m128i = _mm_unpackhi_epi16 (, );
__m128i = _mm_mullo_epi32 (, );
__m128i = _mm_mullo_epi32 (, );
__m128i = _mm_srli_epi32 (, );
__m128i = _mm_srli_epi32 (, );
__m128i = _mm_srli_epi32 (_mm_mullo_epi32 (, ), );
__m128i = _mm_srli_epi32 (_mm_mullo_epi32 (, ), );
__m128i = _mm_packus_epi32 (, );
__m128i = _mm_sub_epi16 (_mm_mullo_epi16 (, ),
_mm_mullo_epi16 (, );
return _mm_min_epu16 (, _mm_sub_epi16 (, ));
If , then the computations up to line 10 are the same but the lines after are replaced by:
Under the only assumption that , ones needs to duplicate latter lines 13 and 14.
For packed 8bit integers since no packed product is natively available, we simply perform most of the computations over packed 16bit integers as follows:
Function
__m128i = _mm_unpacklo_epi8 (, );
__m128i = _mm_unpackhi_epi8 (, );
__m128i = _mm_unpacklo_epi8 (, );
__m128i = _mm_unpackhi_epi8 (, );
__m128i = _mm_mullo_epi16 (, );
__m128i = _mm_mullo_epi16 (, );
__m128i = _mm_srli_epi16 (, );
__m128i = _mm_srli_epi16 (, );
__m128i = _mm_srli_epi16 (_mm_mullo_epi16 (, ), );
__m128i = _mm_srli_epi16 (_mm_mullo_epi16 (, ), );
__m128i = _mm_sub_epi16 (, _mm_mullo_epi16 (, ));
__m128i = _mm_sub_epi16 (, _mm_mullo_epi16 (, ));
__m128i = _mm_packus_epi16 (, );
return _mm_min_epu16 (, _mm_sub_epi8 (, ));
If does not hold, then the same modifications as with packed 16bit integers are applied.
For packed bit integers, a similar extension using packing and unpacking instructions is not possible. We take advantage of the _mm_mul_epu32 instruction. If , then we use the following code:
Function
__m128i = _mm_mul_epu32 (, );
__m128i = _mm_srli_epi64 (, );
__m128i = _mm_srli_epi64 (_mm_mullo_epi32 (, ), );
__m128i = _mm_srli_si128 (, 4);
__m128i = _mm_srli_si128 (, 4);
__m128i = _mm_mul_epu32 (, );
__m128i = _mm_srli_epi64 (, );
__m128i = _mm_srli_epi64 (_mm_mullo_epi32 (, ), );
__m128i = _mm_blend_epi16 (, _mm_slli_si128 (, 4),4+8+64+128);
__m128i = _mm_or_si128 (, _mm_slli_si128 (, 4));
__m128i = _mm_sub_epi32 (, _mm_mullo_epi32 (, ));
return _mm_min_epu32 (, _mm_sub_epi32 (, ));
When , the same kind of modifications as before have to be done: must be computed with vectors of unsigned 64bit integers. Since these vectors do not support the computation of the minimum, one has to use _mm_cmpgt_epi64.
In Table 4 we display timings for multiplying machine integers, using the same conventions as in Table 1. Recall that packed bit integers have no dedicated instruction for multiplication: it is thus done through the 16bit multiplication via unpacking/packing.



Table 5 shows the performance of the above algorithms. The
row “Naive” corresponds to the scalar approach using the
In the scalar approach, compiler optimizations and hardware operations are not always well supported for small sizes, so it makes sense to perform computations on a larger size. On our test platform, 32bit integers typically provide the best performance. The resulting timings are given in the row “padded Barrett”. For 8bit integers, the best strategy is in fact to use lookup tables yielding cycles in average, but this strategy cannot be vectorized. Finally the last two rows correspond to the vectorized versions of Barrett's approach.



For larger integers, the performance is shown in Table 6. Let us mention that in the row “Barrett” with we actually make use of __int128 integer arithmetic.



Table 7 shows the average cost of one product with a fixed multiplicand. In comparison with Table 5, we notice a significant speedup, especially for the vectorial algorithms.
Remark
For Montgomery's algorithm [52], one needs to assume that is odd. Let be such that and let . We need the auxiliary quantities and defined by , , and . They can be classically computed with the extended Euclidean algorithm [26, Chapter 3].
For convenience we introduce . The core of Montgomery's algorithm is the following reduction function:
Function
U = ( * ) & ;
L = + * ;
U = >> m;
if ( < ) return  ;
return ( >= ) ?  : ; }
Let be a modulo integer. We say that is in Montgomery's representation if stored as . The product of two modular integers and , of respective Montgomery's representations and , can be obtained in Montgomery's representation by applying Function 13 to since .
If , then the mask in line 1 can be avoided, and the shift in line 3 might be more favorable than a general shift, according to the compiler. In total, if or , Montgomery's approach is expected to be faster than Barrett's one. Otherwise costs should be rather similar. Of course these cost considerations are rather informal and the real cost very much depends on the processor and the compiler.
Remark
Remark
Table 8 contains timings measured in the same manner as in the previous subsection. Compared to Tables 5 and 6 we observe that Montgomery's product is not interesting in the scalar case for 8 and 16bit integers but it outperforms Barrett's approach in larger sizes, especially when . Compared to Table 7, Montgomery's product is only faster for when and .



Instead of integer types, we can use numeric types such as float or double to perform modular operations. Let us write F such a type, and let represent the size of the mantissa of F minus one, i.e. 23 bits for float and 52 bits for double. This number of bits corresponds to the size of the so called trailing significant field of F, which is explicitly stored. The modular product of and begins with the computation of an integer approximation of . Then is an approximation of at distance . The constant hidden behind the latter depends on rounding modes. In this section we conform to IEEE754 standard. We first analyze the case when fills less than half of the mantissa. We next propose a general algorithm using the fused multiplyadd operation.
The following notations are used until the end of the present section. We write for the value of computed in F by applying the division operator on 1.0 and (F) . Still following IEEE754, the trailing significant field of is written and its exponent . These quantities precisely depend on the active rounding mode used to compute . But for all rounding modes, they satisfy the following properties:
(1) 
From the latter inequality we obtain
and thus
(2) 
Let be a real number of type F, and let F = * be the approximation of computed in F. Again, independently of the rounding mode, there exist integers and , for the trailing significant field and the exponent of , such that
(3) 
From we deduce
(4) 
We also use the approximation of 1.0 / ((F) ) computed in F with rounding mode set towards infinity, so that holds.
When fills no more than half of the mantissa,
that is when , it is possible to perform modular
products in F easily. Let the floor
function return the largest integral value less than or equal to its
argument, as defined in
Function
F = * ;
F = floor ();
F =  * ;
if ( >= ) return  ;
if ( < 0) return + ;
return ; }
Proof. Using (1) and (2) we obtain , hence the floor function actually returns in . From the decomposition
(5) 
we deduce
From (2), we have , and from (4) we deduce . It follows that hence that .
In the same way we did for Barrett's product, and under mild assumptions, we can improve the latter function whenever .
Function
F = * ;
F = floor ();
return  * ; }
Proof. We claim that
(6) 
By (2) and (4), the claim is satisfied as soon as , which is itself implied by , that is correct since it rewrites into by expanding the product.
From and (5) we deduce that
Remark
Inequality (6) finally implies .
Until now we have not exploited the whole mantissa of F. To
release the assumption in Function 14,
the value for could be computed over a
sufficiently large integer type. Over double one can use
bit integers, as implemented for instance in [58]. The drawbacks of this approach are the extra conversions
between numeric and integer types, and the fact that the vectorization
is compromised since bit integer products are
not natively available in the
Function
F = * ;
F = fma (, , );
F = * ;
F = floor ();
F = fma (, , );
F = + ;
if ( >= ) return  ;
if ( < 0) return + ;
return ; }
Proof. Let . We have , so that . We also verify that , which implies that . Using (5) again, the following inequalities hold:
From (2) we have , and from (4) we deduce . It follows that . In particular this implies , whence . This proves that and therefore , which finally implies the correctness of Function 16.
Remark
For efficiency, we assume that
Function
__m128d = _mm_add_pd (, );
__m128d = _mm_sub_pd (, );
return _mm_blendv_pd (, , ); }
Assuming that
Function
__m128d = _mm_mul_pd (, );
__m128d = _mm_fmsub_pd (, , );
__m128d = _mm_mul_pd (, );
__m128d = _mm_floor ();
__m128d = _mm_fnmadd_pd (, , );
__m128d = _mm_add_pd (, );
__m128d = _mm_sub_pd (, );
= _mm_blendv_pd (, , );
= _mm_add_pd (, );
return _mm_blendv_pd (, , ); }
Our
Table 9 is obtained under similar conditions as Tables 1 and 4, but for the types float
and double. The row “Scalar” corresponds to
the unvectorized implementation with unrolled loops, and preventing the
compiler from using autovectorization. The next rows concern timings
using



In Table 10 we have shown timings for the modular sum and
product functions from this section. The row “Scalar”
excludes autovectorization and does not use the processor builtin



We notice that the timings in Table 10 are interesting when compared to those in Table 5. However, for multiplications with a fixed multiplicand, the approach from Table 7 becomes more attractive, and this will indeed be used in Section 5 below.
We also notice that efficient hardware quadruple precision arithmetic
would allow us to consider moduli with larger bit sizes. An alternative
to hardware quadruple precision arithmetic would be to provide an
efficient “threesum” instruction
with correct
The
Consider a typical template class in
For example, the class vector<C,V> (defined in basix/vector.hpp) corresponds to dense vectors with entries in C, stored in a contiguous segment of memory. A vector of this type consists of a pointer of type C* to the first element of the vector, the size of the vector, and the size of the allocated memory. For the sake of simplicity we omit that our vectors are endowed with reference counters. At the top level user interface, for instance, the sum of two vectors is defined as follows:
template<typename C, typename V> vector<C,V> operator + (const vector<C,V>& v, const vector<C,V>& w) { typedef implementation<vector_linear, V> Vec; nat n= N(v); nat l= aligned_size<C,V> (n); C* t= mmx_new<C> (l); Vec::add (t, seg (v), seg (w), n); return vector<C,V> (t, n, l); }
In this piece of code N(v) represents the size of , aligned_size<C,V> (n) computes
the length to be allocated in order to store vectors of size n
over C in memory. According to the values of C
and V, we can force the allocated memory segment to
start at an address multiple of a given value, such as 16 bytes when
using
The classes containing the implementations are specializations of the following class:
template<typename F, typename V, typename W=V> struct implementation;
The first template argument F is usually an empty class which names a set of functionalities. In the latter example we introduced vector_linear, which stands for the usual entrywise vector operations, including addition, subtraction, componentwise product, etc. The value of the argument V for naive implementations of vector operations is vector_naive. The role of the third argument W will be explained later. The naive implementation of the addition of two vectors is then declared as a static member of implementation<vector_linear, vector_naive> as follows:
template<typename V> struct implementation<vector_linear, V, vector_naive> { static inline void add (C* dest, const C* s1, const C* s2, nat n) { for (nat i= 0; i < n; i++) dest[i]= s1[i] + s2[i]; } ../..
Four by four loop unrolling can for instance be implemented within another variant, say vector_unrolled_4, as follows:
template<typename V> struct implementation<vector_linear, V, vector_unrolled_4> { static inline void add (C* dest, const C* s1, const C* s2, nat n) { nat i= 0; for (; i + 4 < n; i += 4) { dest[i ]= s1[i ] + s2[i ]; dest[i+1]= s1[i+1] + s2[i+1]; dest[i+2]= s1[i+2] + s2[i+2]; dest[i+3]= s1[i+3] + s2[i+3]; } for (; i < n; i++) dest[i]= s1[i] + s2[i]; } ../..
When defining a new variant we hardly ever want to redefine the whole set of functionalities of other variants. Instead we wish to introduce new algorithms for a subset of functionalities, and to have the remaining implementations inherit from other variants. We say that a variant V inherits from W when the following partial specialization is active:
template<typename F, typename U> struct implementation<F,U,V>: implementation<F,U,W> {};
For instance, if the variant V inherits from vector_naive, then the add function from implementation<vector_linear,V> is inherited from implementation<vector_linear, vector_naive>, unless the following partial template specialization is implemented: template<typename U> implementation<vector_linear,U,V>.
It remains to explain the role of the three parameters of implementation. In fact the second parameter keeps track of the top level variant from which V inherits. Therefore, in a static member of implementation<F,U,V>, when one needs to call a function related to another set of functionalities G, then it is fetched in implementation<G,U>. In order to illustrate the interest of this method, let us consider polynomials in the next paragraphs.
Our class polynomial<C,V> (defined in algebramix/polynomial.hpp) represents polynomials with coefficients in C using implementation variant V. Each instance of a polynomial is represented by a vector, that is a pointer with a reference counter to a structure containing a pointer to the array of coefficients of type C* with its allocated size, and an integer for the length of the considered polynomial (defined as the degree plus one). The set of functionalities includes linear operations, mainly inherited from those of the vectors (since the internal representations are the same), multiplication, division, composition, Euclidean sequences, subresultants, evaluations, Chinese remaindering, etc. All these operations are implemented for the variant polynomial_naive (in the file algebramix/polynomial_naive.hpp) with the most general but naive algorithms (with quadratic cost for multiplication and division).
The variant polynomial_dicho<W> inherits from the parameter variant W and contains implementations of classical divide and conquer algorithms: Karatsuba for the product, Sieveking's polynomial division, halfgcd, divide and conquer evaluation and interpolations [26, Chapters 8–11]. Polynomial product functions belong to the set of functionalities polynomial_multiply, and division functions to the set polynomial_divide. The division functions of polynomial_dicho<W> are implemented as static members of
template<typename U, typename W> struct implementation<polynomial_divide,U,polynomial_dicho<W> >
They make use of the product functions of implementation<polynomial_multiply,U>.
Let us now consider polynomials with modular integer coefficients and let us describe how the Kronecker substitution product is plugged in. In a nutshell, the coefficients of the polynomials are first lifted into integers. The lifted integer polynomials are next evaluated at a sufficiently large power of two, and we compute the product of the resulting integers. The polynomial product is retrieved by splitting the integer product into chunks and reducing them by the modulus. For details we refer the reader for instance to [26, Chapter 8]. As to our implementation, we first create the new variant polynomial_kronecker<W> on top of another variant W (see file algebramix/polynomial_kronecker.hpp), which inherits from W, but which only redefines the implementation of the product in
template<typename U, typename W> struct implementation<polynomial_multiply,U,polynomial_kronecker<W> >
When using the variant K defined by
typedef polynomial_dicho<polynomial_kronecker<polynomiam_naive> > > K;
the product functions in implementation<polynomial_multiply, K> correspond to the Kronecker substitution. The functions in implementation<polynomial_division, K> are inherited from
implementation<polynomial_division, K, polynomial_dicho<polynomial_naive> >
and thus use Sieveking's division algorithm. The divisions rely in their turn on the multiplication from implementation<polynomial_multiply, K>, which benefits from Kronecker substitution. In this way, it is very easy for a user to redefine a new variant and override given functions a posteriori, without modifying existing code.
Finally, for a given mathematical template type, we define a default variant as a function of the remaining template parameters. For instance, the default variant of the parameter V in vector<C,V> is typename vector_variant<C>::type which is intended to provide users with reasonable performance. This default value is set to vector_naive, but can be overwritten for special coefficients. The default variant is also the default value of the variant parameter in the declaration of vector. Users can thus simply write vector<C>. The same mechanism applies to polynomials, series, matrices, etc.
In
Operations on vectors of integer and numeric types are implemented in a hierarchy of variants. One major variant controls the way loops are unrolled. Another important variant is dedicated to memory alignement.
In order to benefit from vectorized modular arithmetic within the fast Fourier transform, we implemented a vectorized variant of the classical inplace algorithm. In this section, we describe this variant and its applications to polynomial and matrix multiplication.
Let be a commutative field, let with , and let be a primitive th root of unity, which means that , and for all . Let be a vector space. The fast Fourier transform (with respect to ) of an tuple is the tuple with
In other words, , where
denotes the element . If
has binary expansion , then we write for the bitwise mirror of
in length . Following the terminology of [40], the truncated Fourrier transform (
The classical FFT and
Algorithm
Let for all .
Compute .
Let for all .
Compute .
Return .
Proof. Let for , so that we have and for all . Let and . A straightforward calculation leads to
Inverting Algorithm 1 is straightforward: it suffices to
invert steps from 4 to 1 and use the inverse of the
A critical point in large sizes becomes the matrix transposition,
necessary to reorganize data in steps 1 and 4 of Algorithm 1.
We designed ad hoc cachefriendly



Table 12 concerns



For the sake of comparison, we also report on the performance of the


Table 13. 
One major application of the


Table 14. Polynomial product for degrees over , user time in milliseconds. 
If is a field with sufficiently many primitive
roots of unity of order a power of two, then two
matrices and with
coefficients in of degrees
can be multiplied by performing


Table 15. Polynomial matrix product over for degrees , user time in seconds. 
Another important application of fast Fourier transforms is integer
multiplication. The method that we have implemented is based on
Kronecker segmentation and the threeprime FFT (see
[55] and [26, Section 8.3]). Let , and be
three prime numbers. The two integers and of at most bits to be
multiplied are split into chunks of a suitable bitsize
and converted into polynomials and of of degrees ,
with , and . The maximum bitsize of the coefficients of the product
is at most . The
parameter is taken such that
is minimal under the constraint that . The
polynomial can then be recovered from its values
computed modulo , and
using


Table 16. Integer product in bitsize , user time in milliseconds. 
Similarly to polynomial matrices, small matrices over large integers can
be multiplied efficiently using modular


Table 17. Integer matrix product in bitsize , user time in seconds. 
Nowadays
The use of
As a final comment we would like to emphasize that
[1] D. Abrahams and A. Gurtovoy. C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond. Addison Wesley, 2004.
[2] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The design and analysis of computer algorithms. AddisonWesley series in computer science and information processing. AddisonWesley Pub. Co., 1974.
[3] R. Alverson. Integer division using reciprocals. In Proceedings of the Tenth Symposium on Computer Arithmetic, pages 186–190. IEEE Computer Society Press, 1991.
[4] H. G. Baker. Computing A*B (mod N) efficiently in ANSI C. SIGPLAN Not., 27(1):95–98, 1992.
[5] B. Bank, M. Giusti, J. Heintz, G. Lecerf, G. Matera, and P. Solernó. Degeneracy loci and polynomial equation solving. Accepted for publication to Foundations of Computational Mathematics. Preprint available from http://arxiv.org/abs/1306.3390, 2013.
[6] N. Bardis, A. Drigas, A. Markovskyy, and J. Vrettaros. Accelerated modular multiplication algorithm of large word length numbers with a fixed module. In M. D. Lytras, P. Ordonez de Pablos, A. Ziderman, A. Roulstone, H. Maurer, and J. B. Imber, editors, Organizational, Business, and Technological Aspects of the Knowledge Society, volume 112 of Communications in Computer and Information Science, pages 497–505. Springer Berlin Heidelberg, 2010.
[7] P. Barrett. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In A. Odlyzko, editor, Advances in Cryptology – CRYPTO' 86, volume 263 of Lect. Notes Comput. Sci., pages 311–323. Springer Berlin Heidelberg, 1987.
[8] D. J. Bernstein, HsuehChung Chen, MingShing Chen, ChenMou Cheng, ChunHung Hsiao, Tanja Lange, ZongCing Lin, and BoYin Yang. The billionmulmodpersecond PC. In SHARCS'09 Specialpurpose Hardware for Attacking Cryptographic Systems: 131, 2009. http://cr.yp.to/djb.html.
[9] D. J. Bernstein, TienRen Chen, ChenMou Cheng, Tanja Lange, and BoYin Yang. ECM on graphics cards. In A. Joux, editor, Advances in Cryptology  EUROCRYPT 2009, volume 5479 of Lect. Notes Comput. Sci., pages 483–501. Springer Berlin Heidelberg, 2009.
[10] J. Berthomieu, G. Lecerf, and G. Quintin. Polynomial root finding over local rings and application to error correcting codes. Appl. Alg. Eng. Comm. Comp., 24(6):413–443, 2013.
[11] J. Berthomieu, J. van der Hoeven, and G. Lecerf. Relaxed algorithms for adic numbers. J. Théor. Nombres Bordeaux, 23(3):541–577, 2011.
[12] D. Bini and V. Y. Pan. Polynomial and Matrix Computations: Fundamental Algorithms. Progress in Theoretical Computer Science. Birkhauser Verlag GmbH, 2012.
[13] Boost team. Boost (C++ libraries). Software available at http://www.boost.org, from 1999.
[14] W. Bosma, J. Cannon, and C. Playoust. The Magma algebra system. I. The user language. J. Symbolic Comput., 24(34):235–265, 1997.
[15] A. Bosselaers, R. Govaerts, and J. Vandewalle. Comparison of three modular reduction functions. In D. R. Stinson, editor, Advances in Cryptology — CRYPTO' 93, volume 773 of Lect. Notes Comput. Sci., pages 175–186. Springer Berlin Heidelberg, 1994.
[16] British Standards Institution. The C standard: incorporating Technical Corrigendum 1: BS ISO/IEC 9899/1999. John Wiley, 2003.
[17] CLANG, a C language family frontend for LLVM. Software available at http://clang.llvm.org, from 2007.
[18] J.G. Dumas, T. Gautier, C. Pernet, and B. D. Saunders. LinBox founding scope allocation, parallel building blocks, and separate compilation. In K. Fukuda, J. van der Hoeven, M. Joswig, and N. Takayama, editors, Mathematical Software – ICMS 2010, volume 6327 of Lect. Notes Comput. Sci., pages 77–83. Springer Berlin Heidelberg, 2010.
[19] J.G. Dumas, P. Giorgi, and C. Pernet. FFPACK: Finite field linear algebra package. In J. Schicho, editor, Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation, ISSAC '04, pages 119–126. ACM Press, 2004.
[20] J.G. Dumas, P. Giorgi, and C. Pernet. Dense linear algebra over wordsize prime fields: The FFLAS and FFPACK packages. ACM Trans. Math. Softw., 35(3):19:1–19:42, 2008.
[21] A. Fog. Instruction tables. Lists of instruction latencies, throughputs and microoperation breakdowns for Intel, AMD and VIA CPUs. http://www.agner.org/optimize, Copenhagen University College of Engineering, 2012.
[22] A. Fog. Optimizing software in C++. An optimization guide for Windows, Linux and Mac platforms. http://www.agner.org/optimize, Copenhagen University College of Engineering, 2012.
[23] A. Fog. Optimizing subroutines in assembly language. An optimization guide for x86 platforms. http://www.agner.org/optimize, Copenhagen University College of Engineering, 2012.
[24] L. Fousse, G. Hanrot, V. Lefèvre, P. Pélissier, and P. Zimmermann. MPFR: A multipleprecision binary floatingpoint library with correct rounding. ACM Trans. Math. Software, 33(2), 2007. Software available at http://www.mpfr.org.
[25] M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proc. IEEE, 93(2):216–231, 2005.
[26] J. von zur Gathen and J. Gerhard. Modern computer algebra. Cambridge Univ. Press, 2nd edition, 2003.
[27] GCC, the GNU Compiler Collection. Software available at http://gcc.gnu.org, from 1987.
[28] K. Geddes, G. Gonnet, and Maplesoft. Maple (TM). http://www.maplesoft.com/products/maple, from 1980.
[29] P. Giorgi, Th. Izard, and A. Tisserand. Comparison of modular arithmetic algorithms on GPUs. In B. Chapman, F. Desprez, G. R. Joubert, A. Lichnewsky, F. Peters, and Th. Priol, editors, Parallel Computing: From Multicores and GPU's to Petascale, volume 19 of Advances in Parallel Computing, pages 315–322. IOS Press, 2010.
[30] P. Giorgi and R. Lebreton. Online order basis algorithm and its impact on block Wiedemann algorithm. In Proceedings of the 2014 International Symposium on Symbolic and Algebraic Computation, ISSAC '14. ACM Press, 2014. To appear.
[31] T. Granlund et al. GMP, the GNU multiple precision arithmetic library, from 1991. Software available at http://gmplib.org.
[32] T. Granlund and P. L. Montgomery. Division by invariant integers using multiplication. In Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation, PLDI '94, pages 61–72, New York, NY, USA, 1994. ACM Press.
[33] Sardar Anisul Haque and Marc Moreno Maza. Plain polynomial arithmetic on GPU. Journal of Physics: Conference Series, 385(1):012014, 2012.
[34] W. Hart and the FLINT Team. FLINT: Fast Library for Number Theory, from 2008. Software available at http://www.flintlib.org.
[35] W. Hart and the MPIR Team. MPIR, Multiple Precision Integers and Rationals, from 2010. Software available at http://www.mpir.org.
[36] D. Harvey. A cachefriendly truncated FFT. Theoret. Comput. Sci., 410(27–29):2649–2658, 2009.
[37] D. Harvey and D. S. Roche. An inplace truncated Fourier transform and applications to polynomial multiplication. In S. M. Watt, editor, Proceedings of the 2010 International Symposium on Symbolic and Algebraic Computation, ISSAC '10, pages 325–329, New York, NY, USA, 2010. ACM Press.
[38] D. Harvey and A. V. Sutherland. Computing Hasse–Witt matrices of hyperelliptic curves in average polynomial time. Algorithmic Number Theory 11th International Symposium (ANTS XI), 2014. To appear.
[39] W. Hasenplaugh, G. Gaubatz, and V. Gopal. Fast modular reduction. In P. Kornerup and J.M. Muller, editors, 18th IEEE Symposium on Computer Arithmetic, ARITH '07, pages 225–229. IEEE Computer Society, 2007.
[40] J. van der Hoeven. The truncated Fourier transform and applications. In J. Schicho, editor, Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation, ISSAC '04, pages 290–296. ACM Press, 2004.
[41] J. van der Hoeven and G. Lecerf. Interfacing Mathemagix with C++. In M. Monagan, G. Cooperman, and M. Giesbrecht, editors, Proceedings of the 2013 International Symposium on Symbolic and Algebraic Computation, ISSAC '13, pages 363–370. ACM Press, 2013.
[42] J. van der Hoeven and G.
Lecerf. Mathemagix User Guide. HAL, 2013.
http://hal.archivesouvertes.fr/hal00785549.
[43] J. van der Hoeven and G. Lecerf. On the bitcomplexity of sparse polynomial and series multiplication. J. Symbolic Comput., 50:227–254, 2013.
[44] J. van der Hoeven, G. Lecerf, B. Mourain, Ph. Trébuchet, J. Berthomieu, D. Diatta, and A. Mantzaflaris. Mathemagix, the quest of modularity and efficiency for symbolic and certified numeric computation. ACM SIGSAM Communications in Computer Algebra, 177(3), 2011. In Section "ISSAC 2011 Software Demonstrations", edited by M. Stillman, p. 166–188.
[45] J. van der Hoeven, G. Lecerf, B. Mourrain, et al. Mathemagix, 2002–2014. Software available at http://www.mathemagix.org.
[46] Intel Corporation. Intel (R)
intrinsics guide. Version 3.0.1, released 7/23/2013.
http://software.intel.com/enus/articles/intelintrinsicsguide.
[47] Intel Corporation, 2200
Mission College Blvd., Santa Clara, CA 950528119, USA. Intel
(R) Architecture Instruction Set Extensions Programming
Reference, 2013. Ref. 319433015.
http://software.intel.com/enus/intelisaextensions.
[48] Ç. Kaya Koç. Montgomery reduction with even modulus. IEE Proceedings  Computers and Digital Techniques, 141(5):314–316, 1994.
[49] Ç. Kaya Koç, T. Acar, and Jr. Kaliski, B. S. Analyzing and comparing Montgomery multiplication algorithms. Micro, IEEE, 16(3):26–33, 1996.
[50] D. E. Knuth. The Art of Computer Programming, Volume 2: Seminumerical Algorithms. Pearson Education, 3rd edition, 1997.
[51] G. Lecerf. Mathemagix: towards large scale programming for symbolic and certified numeric computations. In K. Fukuda, J. van der Hoeven, M. Joswig, and N. Takayama, editors, Mathematical Software  ICMS 2010, Third International Congress on Mathematical Software, Kobe, Japan, September 1317, 2010, volume 6327 of Lect. Notes Comput. Sci., pages 329–332. Springer, 2010.
[52] P. L. Montgomery. Modular multiplication without trial division. Math. Comp., 44(170):519–521, 1985.
[53] M. Moreno Maza and Y. Xie. FFTBased Dense Polynomial Arithmetic on Multicores. In D. J. K. Mewhort, N. M. Cann, G. W. Slater, and T. J. Naughton, editors, High Performance Computing Systems and Applications, volume 5976 of Lect. Notes Comput. Sci., pages 378–399. Springer Berlin Heidelberg, 2010.
[54] N. Nedjah and L. de Macedo Mourelle. A review of modular multiplication methods and respective hardware implementations. Informatica, 30(1):111–129, 2006.
[55] J. M. Pollard. The fast Fourier transform in a finite field. Math. Comp., 25(114):365–374, 1971.
[56] G. van Rossum and J. de Boer. Interactively testing remote servers using the Python programming language. CWI Quarterly, 4(4):283–303, 1991. Software available at http://www.python.org.
[57] M. J. Schulte, J. Omar, and E. E. Jr. Swartzlander. Optimal initial approximations for the NewtonRaphson division algorithm. Computing, 53(34):233–242, 1994.
[58] V. Shoup. NTL: A Library
for doing Number Theory, 2014. Software, version
6.1.0.
http://www.shoup.net/ntl.
[59] W. A. Stein et al. Sage Mathematics Software. The Sage Development Team, from 2004. Software available at http://www.sagemath.org.
[60] E. Thomé. Théorie algorithmique des nombres et applications à la cryptanalyse de primitives cryptographiques. http://www.loria.fr/~thome/files/hdr.pdf, 2012. Mémoire d'habilitation à diriger des recherches de l'Université de Lorraine, France.
Registered trademarks.