> <\body> <\hide-preamble> >>> >>> \; |<\doc-note> This work has been supported by an ERC-2023-ADG grant for the ODELIX project (number 101142171). <\wide-tabular> ||||||| ||LOGO_ERC-FLAG_FP.png>|61.62pt|28.51pt||>>>>> ||<\author-affiliation> Laboratoire d'informatique de l'École polytechnique LIX, UMR 7161 CNRS Bâtiment Alan Turing, CS35003 1, rue Honoré d'Estienne d'Orves 91120 Palaiseau, France |>>||<\author-affiliation> Laboratoire d'informatique de l'École polytechnique LIX, UMR 7161 CNRS Bâtiment Alan Turing, CS35003 1, rue Honoré d'Estienne d'Orves 91120 Palaiseau, France |>>|||>|<\doc-note> > Straight-line programs have proved to be an extremely useful framework both for theoretical work on algebraic complexity and for practical implementations. In this paper, we expose ideas for the development of high performance libraries dedicated to straight-line programs, with the hope that they will allow to fully leverage the theoretical advantages of this framework for practical applications. > A straight-line program (or SLP) is a program that takes a finite number of input arguments, then executes a sequence of simple arithmetic instructions, and finally returns afinite number of output values. For instance, the function \+7*y|)>>> can be computed using the following SLP: <\equation> >>|x\x>>|7\y>>|a+b>>|r\r>>|>>>>> The instructions operate on a finite list of variables (here ) and constants (here). There are no restrictions on how variables are allowed to be reused: they can be reassigned several times (like ) and they may occur multiple times in the input and output of an operation (like in the operations x\x> and r\r>). SLPs have a profoundly dual nature as programs and data. As programs, SLPs are eligible for arange of operations and techniques from the theory of compilation: constant folding, common subexpression elimination, register allocation, etc. Although the expressive power of SLPs is very limited, their simplicity can be turned into an advantage: many program optimizations that are hard to implement for general purpose programs become simpler and more efficient when restricted to SLPs. In particular, most classical program transformations can be done in linear time. Traditional general purpose compilation techniques also rarely exploit mathematical properties (beyond basic algebraic rules that are useful during simplifications). In the SLP world, algebraic transformations such as automatic differentiation and transposition are easy to implement; more examples will be discussed in this paper. Let us now turn to the data side. The example SLP() can be regarded as a polynomial +7*y|)>>. In this sense, SLPs constitute a data structure for the representation of multivariate polynomials. Note that this representation is more compact than the(expanded) sparse representation +14*x*y+49*y>> and way more compact than the dense \Ptotal degree\Q representation 4>P*x*y>. If our main goal is to a given polynomial many times, then the SLP representation tends to be particularly efficient. However, for explicit computations polynomials, other representations may be preferred: for instance, the dense representation allows for FFT multiplication8.2> and the sparse representation also allows for special evaluation-interpolation techniques. The above efficiency considerations explain why SLPs have become a central framework in the area of algebraic complexity. Joos Heintz has been an early proponent of polynomial SLPs in computer algebra. Together with Schnorr, he designed a polynomial time algorithm to test if an SLP represents the zero function or not, up to precomputations that only depend on the size of the SLP. Later, Giusti and Heintz proposed an efficient algorithm to compute \ the dimension of the solution set of a system of homogeneous polynomials given by SLPs. Then, Heintz and his collaborators showed that the polynomials involved in the Nullstellensatz had good evaluation properties, and could therefore be better represented by SLPs than by dense polynomials. These results led to a new method, called , which allows polynomial systems to be solved in a time that is polynomial in intrinsic geometric quantities>. In order to implement the geometric resolution algorithm, it was necessary to begin with programming efficient evaluation data structures. The first steps were presented at the TERA'1996 conference held in Santander (Spain) by Aldaz, and by Castaño, Llovet, and Martìnez. Later on, a implementation of SLPs was done by Hägele (a former PhD student of Heintz). Then another library was written in the language. Independently, other experiments were conducted to implement the algorithm of in the computer algebra system, that was readily offering an evaluation data structure. Further extensions of these approaches finally led to sharp complexity bounds and successful software implementations. More references can be found in. There are a few other concepts that are very similar to SLPs. Trees and directed acyclic graphs (DAGs) are other popular structures for the representation of mathematical expressions. Conversions to and from SLPs are rather straightforward, but SLPs have the advantage that they expose more low-level execution details. Since performance is one of our primary concerns, we therefore prefer the framework of SLPs in this paper. Another related concept is the blackbox representation. Blackbox programs can be evaluated, but their \Psource code\Q is not available. This representation is popular in numerical analysis and the complexity of various numerical methods4, 9, 10, and17> can be expressed in terms of the required number of function evaluations (the length of an SLP corresponds to the cost of one function evaluation in this context). The blackbox representation is also popular in computer algebra, especially in the area of sparse interpolation. However, for programs that we write ourselves, it is unnatural to disallow opening the blackboxes. This restriction makes it artificially hard to perform many algebraic transformations such as modular reduction, automatic differentiation, etc. On the other hand, computing the determinant of an n>> matrix can easily be done in time |)>> by a blackbox using Gaussian elimination. This task is less trivial for SLPs, since branching to find appropriate pivots is forbidden. Despite the importance of SLPs for theoretical and practical purposes, they have not yet been the central target of adedicated high performance library, up to our knowledge. (We refer to for early packages for polynomial SLPs.) In contrast, such libraries exist for integer and rational numbers(), multiple precision floating point numbers (), linear algebra(BLAS, ), ball arithmetic(), etc. As a consequence, we feel that a lot of potential advantages ofSLPs are underexploited in current computer algebra systems and beyond. In 2015, we started the development of such a dedicated library for SLPs, called . This library is part of the system and publicly available SVN since revision 10218. More recently, we have started a new reimplementation in C++ which aims at even higher performance. This new stand-alone library, called, can be used with or without . Several of our past publications relied implicitly, but quite essentially, on the library. This paper focuses for the first time on the SLP libraries themselves. We will mainly present the high level motivations, goals, and design decisions. Future papers shall provide more details on the ongoing implementations and specific applications. Performance is a primary motivation behind our work. Both operations on SLPs and their execution should be as efficient as possible. At the same time, we wish to be able to fully exploit the generality of the SLP framework. This requires a suitable representation of SLPs that is both versatile and efficient to manipulate. As will be detailed in section, we will represent SLPs using a suitable low-level encoding of pseudo-code as in(), while allowing its full semantic richness to be expressed. For instance, the ordering of the instructions in() is significant, as well as the ability to assign values multiple times to the same variables. One may also specify a precise memory layout. One typical workflow for SLP libraries is the following: <\equation> |>> > |optimize>> |)>> |backend>> machine code> For instance, we might start with an SLP that computes the determinant of a4> matrix multiplication, then compute its gradient using automatic differentiation, then optimize this code, and finally generate executable machine code. The initial SLP can be specified directly using low-level pseudo-code. But this is usually not most convenient, except maybe for very simple SLPs. In section, we describe a higher level recording mechanism. Given an existing, say, C++template implementation for the computation of determinants of matrices, we may specialize this routine for a special \Precorder\Q type and run it on a 4> matrix with generic entries. The recording mechanism will then be able to recover a trace of this computation and produce an SLP for determinants of 4> matrices. In section, we turn to common program transformations like simplification, register allocation, etc. Although this is part of the classical theory of compilation, the SLP framework gives rise to several twists. Since our libraries are required to be efficient, they allow us to efficiently generate SLPs on the fly. Such SLPs become disposable as soon as they are no longer needed. Consequently, our SLPs can be large, which makes it crucial that transformations on SLPs run in linear time. The philosophy is to sacrifice expensive optimizations that require more time, but to make up for this by generating our SLPs in such a way that cheaper optimizations already produce almost optimal code. That being said, there remains a trade-off between speed of compilation and speed of execution: when a short SLP needs to be executed really often, it could still make sense to implement more expensive optimizations. But the primary focus of our libraries is on transformations that run in linear time. When considering SLPs over specific algebraic domains such as integers, modular integers, floating point numbers, polynomials, matrices, balls, , it may be possible to exploit special mathematical properties of these domains in order to generate more efficient SLPs. This type of transformations, which can be problematic to implement in general purpose compilers, are the subject of section. Popular examples from computer algebra are automatic differentiation and transposition. Another example was described in: ball arithmetic allows us to lift any numerical algorithm into a reliable one that also produces guaranteed error bounds for all results. However, rounding errors need to be taken into account when implementing a generic ball arithmetic, which leads to a non-trivial overhead. When the algorithm is actually an SLP, this overhead can be eliminated through a static analysis. Similarly, assume given an SLP over arbitrary precision integers, together with a bound for the inputs. Then one may statically compute bounds for all intermediate results and generate a dedicated SLP that eliminates the need to examine the precisions during runtime. This may avoid many branches and subroutine calls. Several applications that motivated our work are discussed in section. From the perspective of a high level application, the SLP library is a mixture of a compiler and atraditional computer algebra library. Instead of directly calling a high level function like amatrix product or a discrete Fourier transform (DFT), the idea is rather to compute afunction that is able to efficiently compute matrix products or DFTs of a specified kind( over a particular coefficient ring and for particular dimensions). In this scenario, the library is used as a tool to accelerate critical operations. High level master routines for matrix products and DFTs may automatically compile and cache efficient machine code for matrices and DFTs of small or moderate sizes. Products and DFTs of large sizes can be reduced to these base cases. It is an interesting question whether these master routines should also be generated by the SLP library. In any case, the SLP framework allows for additional flexibility: instead of statically instantiating C++ templates for a given type, we may generate these instantiations dynamically. Moreover, using the techniques from section, this can be done in a way that exploits mathematical properties of the type, which are invisible for the C++ compiler. More specifically, we will discuss applications to numerical homotopy continuation, discrete Fourier transforms, power series computations, differential equations, integer arithmetic, and geometric resolution of polynomial systems. There are many other applications, starting with linear algebra, but our selection will allow us to highlight various particular advantages and challenges for the SLP framework. The last section is dedicated to benchmarks. Some timings for the library were already presented in our previous papers. For this reason, we mainly focus on some first timings for the library, while noting that the development of this library is still at an early and experimental stage. Nonetheless, we were able to appreciate asignificant speed-up for operations on SLPs with respect to the library; the generated machine code is of comparable quality. Throughout this paper, we denote \|}>>. For every \>, we also define \,n-1|}>>. <\acknowledgments*> We wish to thank Albin , Alexander , and Arnaud for useful discussions and suggestions. We start this section with an abstract formalization of SLPs. Our presentation is slightly different from or, in order to be somewhat closer to the actual machine representation that we will describe next. Let > be aset of operation symbols together with a function |\|>\\\\>. We call > a and|\|>> the of >, for any operation \\>. A with signature > is aset> together with afunction >\\|\|>>\\>> for every \\>. We often write > instead of>> if no confusion can arise. The domain > is said to be if we have data structures for representing the elements of> and > on a computer and if we have programs for |\|>> and>>, for all \\>. <\example> For the introductory example(), we may take \|}>> to be the signature of rings with >|\|>=|\|>=|\|>=2> and > to be the ring > or >. <\remark> In model theory and mathematical logic, > is also called a (without relation symbols) and > a structure for this language; see, , B>. For simplicity, we restricted ourselves to one-sorted languages, but extensions to the multi-sorted case are straightforward. Let > be an effective domain with signature >. An is a tuple <\equation*> \=>,a,0>,\,a,|\|>>|)> with >\\> and ,0>,\,a,|\|>>\\>, where |\|>\|\|>>. We call ,0>> the or and ,1>,\,a,|\|>>> the or . We denote by >> the set of such instructions. Given \>, we also let ,n>\\\>\a,1>,\,a,|\|>>\\|}>>. A () over > is a quadruple >, where <\itemize> ,\,D-1>|)>\\>> is a tuple of data fields, ,\,I-1>|)>\\>>> is a tuple of pairwise distinct input locations, ,\,O-1>|)>\\>>> is a tuple of output locations, ,\,P-1>|)>\\,>> is a tuple of instructions. We regard >> as the working space for our SLP. Each location in this working space can be represented using an integer in >> and instructions directly operate on the working space as we will describe now. Given a \\>> of our working space, the of an instruction \\,>> gives rise to a new state \\|)>> where \\> if a,0>> and \\>,1>>,\,\,|\|>>>|)>> if ,0>>. Given ,\,x-1>|)>\\>>, the corresponding begin state > of our SLP is given by \D> if ,\,I-1>|}>> and >\x> for \>>. Execution of the SLP next gives rise to a sequence of states \P|)>,\,\;>\P>-1;>|)>>. The ,\,y-1>|)>\\>> of our SLP are now given by \\;O>> for \>>. In this way our SLP gives rise to a function >\\>;x\y> that we will also denote by. Two SLPs will be said to be if they compute the same function. <\example> With > and > as in Example, our introductory example() can be represented as follows: >, >, >, and ,P,P,P|)>> with <\eqnarray*> >||>,3,0,0|)>>>|>||>,4,5,1|)>>>|>||,2,3,4|)>>>|>||>,2,2,2|)>.>>>> One may check that this SLP indeed computes the intended function when using the above semantics. Note that we may chose arbitrary values for the input fields,D>>, the output field >, and the auxiliary data fields >, >. So we may just as well take them to be zero, as we did. For the sake of readability, we will continue to use the notation() instead(), the necessary rewritings being clear. If we wish to make the numbering of registers precise in(), then we will replace by ,x,x,x,x,x>: <\equation> ,x|)>>>|\7>>|\x\x>>|\x\x>>|\x+x>>|\x\x>>||)>.>>>>> Note that \7> is not really an instruction, but rather a way to indicate that the data field> contains the constant . It will be useful to call elements of >> and denote them by more intelligible symbols like ,\,x-1>> whenever convenient. The entries of and are called and , respectively. Non-input variables that also do not occur as destination operands of instructions in are called . An is avariable that is neither an input variable, nor an output variable, nor a constant. Given avariable \>>> the corresponding is >. Input variables, output variables, and constants naturally correspond to , , and . <\remark> One might prefer to use mutually distinct input variables, output variables, auxiliary variables, and constants. One may also wish to impose that output variables can only be assigned once. In our formalism none of this is mandatory, but any SLP can easily be rewritten into an equivalent SLP that satisfies these properties. The ability to share input and output variables or different output variables can be useful in order to keep the length> of theSLP as small as possible. <\remark> Testing the equivalence of two polynomial SLPs is hard in general. But there is an easy probabilistic algorithm to do this: simply evaluate the SLP at a random point. For SLPs of bounded size, and showed how to derandomize this method, by using a suitable deterministic sequence of points instead. Of course, this only works in theory, since such sequences are necessarily hard to compute! One crucial design decision concerns the choice of the signature > and the way how operations are represented. On the one hand, we do not want to commit ourselves to one specific signature, so it is important to plan future extensions and even allow users to specify their own dynamic signatures. This kind of flexibility requires no extra implementation effort for many \Psignature-agnostic\Q operations on SLPs, such as common subexpression elimination, register allocation, etc. On the other hand, the precise signature does matter for operations like simplification, automatic differentiation, etc. We need our code to be as efficient as possible for the most common operations in >, while providing the possibility to smoothly extend our implementations to more exotic operations whenever needed. In the library, we chose to represent the operations in> by abstract expressions. Exploiting pattern matching facilities inside , this allowed us to implement operations on SLPs in an elegant and naturally extensible way. However, this design choice resulted in a non-trivial efficiency overhead. In the library, we therefore decided to represent operations more directly by machine integers. Some of the bits of these integers are reserved for interesting properties of the operation, such as its arity. Lookup tables are used to retrieve more extensive information. What about the choice of a subsignature > of operations that we deem to be\Pmost common\Q and that should benefit from particularly efficient support in the library? At the end of the day, we wish to turn our SLPs into efficient machine code. It is therefore useful to draw inspiration from the instruction sets of modern CPUs and GPUs. For example, this motivated our decision to include the (FMA) operation \a\b+c>> in >, together with its signed variants \a\b-c>, \c-a\b>, and \-a\b-c>. Although this adds to the complexity of > (why not emulate FMA using multiplication and addition?), this decision tends to result in better machine code (more about this below; see also section). However, the instructions sets of modern processors are very large. In order to keep our SLP library as simple as possible, it would be unreasonable to directly support all these instructions. Much of the complexity of modern instruction sets is due to multiple instances of the same instruction for different data types, SIMD vector widths, immediate arguments, addressing modes, guard conditions, etc. In the SLP setting it is more meaningful to implement general mechanisms for managing such variants. Modern processors also tend to provide a large number of exotic instructions, which are only needed at rare occasions, and which can typically be emulated using simpler instructions. One useful criterion for deciding when and where to support particular operations is to ask the question how easily the operation can be emulated, while resulting in final machine code of the same quality. Let us illustrate this idea on a few examples. Imagine that our hardware supports FMA, but that we did not include this operation in >. In order to exploit this hardware feature, we will need to implement a special post-processing transformation on SLPs for locating and replacing pairs of instructions like b\c> and a+e> by FMAs, even when they are separated by several independent instructions. This is doable, but not so easy. Why bother, if it suffices to add FMA to our signature and generate high quality code with FMAs from the outset? Maybe it will require more implementation efforts to systematically support FMAs inside our routines on SLPs, but we won't need a separate and potentially expensive post-processing step. In fact, for floating point numbers that follow the IEEE 754 standard, the FMA instruction cannot be emulated using one multiplication and one addition. This is due to the particular semantics of correct rounding: if >> denotes the double precision number that is nearest to a real number , then we do not always have >b+c|)>=>>b|)>+c|)>>. The IEEE754 rounding semantics can be exploited for the implementation of extended precision arithmetic and modular arithmetic. This provides additional motivation for including FMA in our basic signature >. We do not only have flexibility regarding the signature >, but also when it comes to the domains > that we wish to support. For the selection of operations in >, it is also important to consider their implementations for higher level domains, such as modular integers, matrices, truncated power series, etc. Now the implementation of >> is often simpler than the combined implementation of >> and >>. For instance, if =\n>> is the domain of n> matrices over a field, then >> can be implemented using > FMAs over>, whereas its emulation via >> and >> would require +n> instructions. One may argue that the better code should be recoverable using a suitable post-processing step as described above. This is true for matrix multiplication, but turns out to become more problematic for other domains, such as modular integers that are implemented using floating point FMAs. Once more, the inclusion ofFMA into our signature > tends to facilitate the generation of high quality machinecode. Let us finish with an example of some operations that we do wish to include in >. Instruction sets of CPUs allow us to work with integers and floating point numbers of various precisions. How can we do computations with mixed types in our framework? Instead of mimicking the CPUs instruction set within >, the idea is rather to provide a general mechanism to dynamically enrich >. Here we may exploit the fact that our framework made very few assumptions on > and >. Given domains ,\,\>> of signatures ,\,\\\>, the idea is to dynamically create a signature \|)>\k\s>,\\\|}>> and adomain \\\\\\\> such that |)>\\> operates on > via |)>,\,a|\|>>|)>=\,\,a|\|>>|)>> if ,\,a|\|>>\\> and |)>,\,a|)>=error> otherwise. We regard > and > as being disjoint, which allows us to dynamically add > to >. \ In this way, it is easy to emulate computations with mixed types and signatures in our abstract SLP framework (and thereby support multi-sorted structures as in Remark). This kind of emulation comes with some overhead though. But we consider this to be a small price to pay for the simplicity that we gain by avoiding the management of multiple types and signatures in the framework itself. The SLP framework is more general than it may seem at first sight due to the fact that we are completely free to chose the signature > and the underlying domain >. In principle, this allows for all data types found in computer algebra systems. Nonetheless, for our library it is wise to include privileged support for a more restricted set of domains, similarly to what we did with > in the case of signatures. First of all, we clearly wish to be able to have domains for all the machines types that can be found in hardware. In traditional CPUs, these types were limited to floating point numbers and (signed and unsigned) integers of various bit-sizes. In modern CPUs and GPUs, one should add SIMD vector versions of these types for various widths and possibly a few more special types like matrices and polynomials over > of several sizes and degrees. On top of the machine types, we next require support for basic arithmetical domains such as vectors, matrices, polynomials, jets, modular integers and polynomials, extended precision numbers, intervals, balls, etc. Such functionality is commonplace in computer algebra systems or specialized HPC libraries such as . However, in the and libraries, special attention is paid to domains for which the corresponding signature can itself be implemented using SLPs over lower level types. We call such domains and we will describe them in more detail in section below. The domain /|2*\|\>|)>3>> of 3> matrices with 64 bit integer coefficients in a typical example of an SLP algebra. One advantage is that instances can always be represented as dense vectors of a fixed size. In the library, we implemented SLPs as a type that is templated by the domain. Here we take advantage of the fact that supports both static templates (that are instantiated at compile time, like in C++) and dynamic templates (for which the template parameters are passed as extra parameters); see. Template parameters are themselves typed, which means that the signature > of our domain must be specified. In our implementation, we use > as the signature, while agreeing that operations in > that are not implemented for a given domain simply raise an error. We also provide default implementations for operations like FMA that can be emulated using other operations. In order to achieve full extensibility, it suffices to add a generic operation to our signature for all operations in \\>; the arguments of this generic operation are the concrete operation in \\> and the (variable number of) arguments of this operation. In our lower level C++ implementation in , we introduced a special abstract data type for domains. Each operation in > gives rise to a virtual method for this type which can carry out the operation for the concrete domain at hand. An element of adomain is represented as one consecutive block of data inside memory. The size of this block is fixed and can be retrieved using another virtual method of the type. This mechanism is essentially the same as the one from the library, except that static instantiation for a particular domain is no longer possible (of course, this leads to some overhead, but we will see in section below that this will usually not be a problem, because our real aim will be to record execution traces). Full extensibility can again be achieved by adding one more virtual function for generic operations. Given an abstract SLP >, it is natural to represent , , and as actual vectors in contiguous memory. In the cases of and , these vectors have machine integer coefficients. If elements of > can themselves be represented by vectors of a fixed size, then one may use the additional optimization to concatenate all entries of into asingle vector, which we do in the library. For the representation of programs , we have several options. In the library, we use a separate data type for instructions in >>, so that is simply represented as a vector of instructions. In , both elements of > and the arguments of instructions are represented by machine integers. Consequently, instructions in >> can be represented by machine integer vectors. This again makes it possible to concatenate all instructions into a single machine integer vector, which can be thought of as\Pbytecode\Q. The systematic concatenation of vector entries of vectors in greatly reduces the number of memory allocations. Note however that instructions may have different lengths as vectors. If we wish to quickly address specific instructions, one linear pass through is required in order to build a table with the offsets of the individual instructions. Alternatively, we could have imposed a maximal arity for the operations in >, which would have allowed us to represent all instructions by vectors of the same length. This alternative representation is less compact but has the advantage that individual instructions can be addressed more easily. It is also not so clear what would be a good maximal arity. The library contains one more low-level optimization. Instructions and data fields of SLPs are typically built one by one. During their construction, this means that \ and should really be considered as output streams rather than vectors. Moreover, it is almost always possible to give a simple and often tight bound for the lengths of and at the end of this streaming process. This makes it possible to allocate these vectors once and for all and directly stream the entries without being preoccupied by range checks or reallocations of larger memory blocks. We have considered but discarded a few alternative design options. All operations in> have a single output value in our implementation. Since SLPs are allowed to have multiple output values, it might be interesting to allow the same thing for operations in>. Some instructions in CPUs are indeed of this kind, like long multiplication(x86) or loading several registers from memory (ARM). However, this elegant feature requires an additional implementation effort for almost all operations on SLPs. This also incurs aperformance penalty, even for the vast majority of SLPs that do not need this feature. Note finally that it is still possible to emulate instructions with multiple outputs by considering single vector outputs and then introducing extra instructions for extracting entries of these vectors. Unfortunately however, this is really inelegant, because it forces us to use more convoluted \ non-fixed-sized vector domains. It is also non-trivial to recombine these multiple instructions into single machines instruction when building the final machine code. Similar considerations apply for in-place semantics like in the C instruction. Here again, we note that our SLP formalism allows for this kind of semantics, since entries of and are not required to be disjoint. But it is also easy to emulate an instruction like by . Yet another interesting type of alternative design philosophy would be to systematically rule out redundant computations by carrying out common subexpression elimination during the very construction of SLPs. However, for cache efficiency, it is sometimes more efficient to recompute avalue rather than retrieving it from memory. Our representation provides greater control over what is stored where and when. The systematic use of common subexpression elimination through hash tables also has a price in performance. Recall that two SLPs are if they compute the same mathematical function. Among two equivalent SLPs, the shorter one is typically most efficient. But the length of an SLP is not the only factor for its practical performance. It is also important to keep an eye on dependency chains, data locality, and cache performance. Besides performance, \Preadability\Q of an SLP may also be an issue: we often change the numbering of data fields in a way that makes the SLP more readable. We catch all these considerations under the concept of . This concept is deliberately vague, because the practical benefits of SLPs of \Phigher quality\Q depend greatly on our specific hardware. Operations on SLPs should be designed carefully in order to avoid needless quality deteriorations. This can often be achieved by touching the input SLPs \Pas least as possible\Q and only in \Pstraightforward ways\Q. Let us briefly review how dependency chains and data layout may affect the performance. Consider the following two SLPs for complex multiplication: <\equation> ,y,x,y|)>>|>|,y,x,y|)>>>|\x\x>||\x\x>>|\fms,y,x|)>>||\x\y>>|\x\y>||\fms,y,x|)>>>|\fma,y,y|)>>||\fma,y,y|)>>>|,y|)>>||,y|)>>>>>> Here ,y,x|)>\x-y\y> stands for \Pfused multiply subtract\Q and ,y,y|)>\y+x\y> for \Pfused multiply add\Q. Modern hardware tends to contain multiple for additions, multiplications, fused multiplications, etc., which allow for . In the right SLP, this means that the instruction \x\y> can be to a second multiplication unit while \x\x> is still being executed by afirst unit. However, in the left SLP, the instruction \fms,y,x|)>> has to wait the for the outcome of \x\x>. Whereas the fused multiplications in both SLPs on the result of one earlier multiplication, these dependencies are better spread out in time in the right SLP, which typically leads to better performance. Let us now turn to the numbering of our data fields in . When generating the final machine code for our SLP, we typically cut into a small number of contiguous blocks, which are then mapped into contiguous chunks in memory. For instance, we might put the input, output, auxiliary, and constant fields in successive blocks of and map them to separate chunks in memory. Now memory accesses are most efficient when done per cache line. For instance, if cache lines are 1024 bits long and if we are working over 64 bit double precision numbers, then fetching > from memory will also fetch ,\,D>. The cache performance of an SLP will therefore be best when large contiguous portions of the program access only a proportionally small number of data fields >, preferably by chunks of 16 contiguous fields. The theoretical notion of partially captures these ideas. Divide and conquer algorithms tend to be cache oblivious and thereby more efficient. Computing the product of two n> matrices using a naive triple loop is not cache oblivious (for large ) and can therefore be very slow. In the terminology of this subsection, high quality SLP should in particular be cache oblivious. Where does the \Pinitial SLP\Q come from in the workflow()? It is not very convenient to oblige users to specify using pseudo-code like(), (), or(). Now we often already have some existing implementation of , albeit not in our SLP framework. In this section we explain how to derive an SLP for from an existing implementation, using a mechanism that records the trace of a generic execution of . Modulo the creation of appropriate wrappers for recording, this mechanism applies for any existing code in any language, as long as the code is either generic or templated by a domain > that implements at most the signature >. By \Pat most\Q, we mean that the only allowed operations on elements from > are the ones from >. Recording can be thought of as a more powerful version of inlining, albeit restricted to the SLP setting: instead of using complex compilation techniques for transforming the existing code, it suffices to implement a cheap wrapper that records the trace of an execution. Hence the names and . Let us describe how to implement the recording mechanism for existing C or C++ code over a domain >. For this we introduce a type , which could either be a template with > as a parameter, or a generic but slower recorder type that works over all domains. In order to record a function, say \\>, we proceed as follows: <\cpp-code> slp record_f () { \ \ start_recorder (); \ \ recorder in1= input (); \ \ recorder in2= input (); \ \ recorder out= f (in1, in2); \ \ output (out); \ \ return stop_recorder (); } In this code we use global variables for the state of the recorder. This state mainly consists of the current quadruple >, where are thought of as output streams (rather than vectors). When starting the recorder, we reset to being empty. Instances of the type represent indices of data fields in . The instruction adds a new zero element to, adds the corresponding index ( in this case) to , and next stores it in . Similarly, the code adds a second zero element to , adds the corresponding index to , and then stores it in . We next execute the actual function on our input data. For this, the data type should implement a default constructor, a constructor from >, and the operations from> (as well as copy and assignment operators to comply with C++). The default constructor adds a new auxiliary data field to , whereas the constructor from> adds a new constant field. An operation like multiplication of and creates a new auxiliary destination field in and streams the instruction > into . At the end, the instruction adds to and returns the SLP represented by the quadruple >. <\remark> The above implementation introduces a new auxiliary data field for every operation. This results in SLPs that use more data fields than necessary. Let us describe how to improve this. First of all, we now represent instances of by an index, as before, together with a flag that indicates whether the instance corresponds to an auxiliary field or to an input, output, or constant field. We next create a destructor for : whenever we destroy an auxiliary field, we add its index to a stack. This stack of destroyed data fields is maintained as part of the global state and emptied when calling . We next adapt the default constructor of auxiliary fields to reclaim them from the stack and only introduce anew data field when the stack isempty. <\remark> When the same constant is constructed multiple times during the execution of, the above recording algorithm will put them in separate data fields. Another useful optimization is to maintain a table of constants (as part of the global state) and directly reuse constants that where allocated before. 2> matrix multiplication> Let us investigate in detail what kind of SLPs are obtained through recording on the example of 2> matrix multiplication. Assume that matrices are implemented using aC++ template type C\> with the expected constructors and accessors. Assume that matrix multiplication is implemented in the following way: <\cpp-code> template\typename C\ matrix\C\ operator * (const matrix\C\& M, const matrix\C\& N) { \ \ int r= rows (M), l= columns (M), c= columns (N); \ \ if (rows (N) != l) throw "error"; \ \ matrix\C\ R (r, c); \ \ for (int i=0; i\r; i++) \ \ \ \ for (int (j=0; j\c; j++) { \ \ \ \ \ \ C sum= 0; \ \ \ \ \ \ for (int k=0; k\l; k++) \ \ \ \ \ \ \ \ sum= sum + M(i,k) * N(k,j); \ \ \ \ \ \ R(i,j)= sum; \ \ \ \ } \ \ return R; } When applying this code for two generic 2> input matrices in a similar way as in the previous subsection, without the optimizations from Remarks and, we obtain the following SLP: <\equation> |||||||||||||||,\,x|)>>|>|>|>>|\0>|\0>|\0>|\0>>|\x\x>|\x\x>|\x\x>|\x\x>>|\x+x>|\x+x>|\x+x>|\x+x>>|\x\x>|\x\x>|\x\x>|\x\x>>|\x+x>|\x+x>|\x+x>|\x+x>>|>|>|>|,x,x,x|)>>>>>> (The order of this code is column by column from left to right.) In order to understand the curious numbering here, one should remind how constructors work: the declaration of gives rise to four default constructors of instances of . These are put in ,x,x,x> but never used. The constructor of next creates the constant , which is put in >. We then record the successive instructions, the results of which are all put in separate data fields. Note that \0,\,x\0> are not really instructions, but rather aconvenient way to indicate the constant fields ,\,D> all contain zero. When applying the optimizations from Remarks and, the generated code becomes as follows: <\equation> |||||||||||||||||,\,x|)>>|||>|\0>|>|>|>>|\x\x>|\x\x>|\x\x>|\x\x>>|\x+x>|\x+x>|\x+x>|\x+x>>|\x\x>|\x\x>|\x\x>|\x\x>>|\x+x>|\x+x>|\x+x>|\x+x>>|>|>|>|,x,x,x|)>>>>>> The numbering may look even odder (although we managed to decrease the length of). This is due to the particular order in which auxiliary data fields are freed and reclaimed via the stack. The example from the previous subsection shows that the recording mechanism does not generate particularly clean SLPs. This is not a problem, because right after recording a function we typically simplify the resulting SLP, as will be detailed in section below. However, the simplifier may not be able to correct all problems that we introduced when doing the recording in a naive manner. Simplification also tends to be faster for shorter and cleaner SLPs. Simplification might not even be needed at all. The advantage of the mechanism from the previous section is that it can be applied to an external template C++ library over which we have no control. One reason for the somewhat messy results() and() is that C++ incites us to create classes with nice high level mathematical interfaces. Behind the scene, this gives rise to unnecessary temporary variables and copying of data. We may avoid these problems by using a lower level in-place interface for and C\>. For instance: <\cpp-code> template\typename C\ void mul (matrix\C\& R, const matrix\C\& M, const matrix\C\& N) { \ \ int r= rows (M), l= columns (M), c= columns (N); \ \ if (rows (N) != l \|\| rows (R) != r \|\| columns (R) != c) \ \ \ \ throw "error"; \ \ for (int i=0; i\r; i++) \ \ \ \ for (int (j=0; j\c; j++) { \ \ \ \ \ \ mul (R(i,j), M(i,0), N(0,j)); \ \ \ \ \ \ for (int k=1; k\l; k++) \ \ \ \ \ \ \ \ fma (R(i,j), M(i,k), N(k,j), R(i,j)); \ \ \ \ } } When recording our SLP using this kind of code, we obtain a much cleaner result: <\equation> |||||||||||,\,x|)>>|>>|\x\x>|\x\x>>|\fma,x,x|)>>|\fma,x,x|)>>>|\x\x>|\x\x>>|\fma,x,x|)>>|\fma,x,x|)>>>|>|,x,x,x|)>>>>>> For this reason, the library implements various common domains (vectors, matrices, polynomials, modular numbers, balls, etc.) using an in-place style. As explained in section, this implementation is generic, which induces a performance penalty with respect to the above template implementation. But this does not really matter: the focus of our generic implementation is not on high execution speed, but on the generation of high quality SLPs through recording. At a second stage, we will be able to generate even more efficient machine code from these SLPs. <\remark> The in-place calling convention comes with the subtlety that destination arguments might collide with source arguments. This phenomenon, called , is both apain and a blessing. It is a pain because the above multiplication algorithm becomes incorrect in this situation. Fortunately there is the easy remedy to copy colliding arguments into auxiliary variables. Testing for collisions induces additional overhead when recording the code, but the generated SLP is not impacted if no collisions occur. What is more, if collisions do occur, then we actually have the opportunity to generate an even better SLP. This is for instance interesting for discrete Fourier transforms(DFTs), which can either be done in-place or out-of-place. In our framework, we may implement a single routine which tests in which case we are and then uses the best code in each case. The shifted perspective from efficiency of execution to quality of recorded code cannot be highlighted enough. One consequence is that the overhead of control structures only concerns the cost of recording the SLP, but not the cost of executing the SLP itself. For example, in Remark, this holds for the cost of checking whether there are collisions between input and output arguments. But the same holds for the overhead of recursive function calls. For instance, one may implement matrix multiplication recursively, by reducing an\\c>> product into an \\\c> and an \\\c> product with =||\>> and +r=r> (or using a similar decomposition for > or if these are larger). The advantage of such a recursive strategy is that it tends to have a better cache efficiency. Now traditional implementations typically recurse down only until a certain threshold, like ,c\64>, because of the overhead of the recursive calls. But this overhead disappears when the multiplication is recorded as an SLP, so we may just as well recurse down to =c=1>. An even more extreme example is relaxed power series arithmetic. Consider basic arithmetic operations on truncated power series such as inversion, exponentiation, etc. If the series are truncated at a fixed order that is not too large (say 256>), then lazy or relaxed algorithms are particularly efficient (see and section below). However, they require highly non-conventional control structures, which explains that they are not as widely used in computer algebra systems as they might be. This disadvantage completely vanishes when these operations are recorded as SLPs: there is still a little overhead while recording an SLP, but not when executing it. The shifted perspective means that we should focus our attention on the SLP that we wish to generate, not on the way that we generate it. For instance, the SLP() is still not satisfactory, because it contains several annoying dependency chains: the execution of the instruction ,x,x,x|)>> cannot start before completing the previous instruction ,x,x|)>> and similarly for the other FMAs. For CPUs with at least 12 registers, the following SLP would behave better: <\equation> |||||||||||,\,x|)>>|>>|\x\x>|\fma,x,x|)>>>|\x\x>|\fma,x,x|)>>>|\x\x>|\fma,x,x|)>>>|\x\x>|\fma,x,x|)>>>|>|,x,x,x|)>>>>>> Again, the SLP() could have been rewritten into this form automatically, using an appropriate optimization that detects dependency chains and reschedules instructions. Another observation is that the dependency chains were not present in(), so they were introduced by the optimizations from Remark; this problem can be mitigated by replacing the first-in-first-out \Pstack allocator\Q by one that disallows the last > freed registers to be immediately reused. But then again, it is even better to design the routines that we wish to record in such a way that none of these remedies are needed. For instance, when doing \\c> matrix products in the above recursive fashion, it suffices to recurse on or when \2> and \\>. The library comes with additional tools to ease programming according to this \Pshifted design perspective\Q: an abstract type for maps (and intended to be optimized for recording), abstract types and for data objects and references (that make it easy to access data structures in a strided way, for instance), and more. We plan to detail these more advanced features in future work. This section is dedicated to classical program transformations like common subexpression elimination, simplification, register allocation, etc. As explained in the introduction, this is part of the traditional theory of compilation, but the SLP framework gives rise to several twists. SLPs can become very large and should therefore be disposable as soon as they are no longer needed. Our resulting constraint is that all algorithms should run in linear time (or in expected linear time if hash tables are required). We recall that another important general design goal is to preserve the \Pquality\Q of an SLP as much as possible during transformations. The example SLPs (),(),(), and() from the previous subsection are all equivalent, but their quality greatly differs in terms of code length, data locality, cache efficiency, dependency chains, and general elegance. It is the responsibility of the user to provide initial SLPs of high quality in(). But it is on the SLP library to not deteriorate this quality during transformations, or, at least, to not needlessly do so. An extreme case of this principle is when the input SLP already accomplishes the task that a transformation is meant to achieve. Ideally speaking, the transformation should then simply return the input SLP. One typical example is the simplification of an already simplified SLP. For the sake of completeness, we first describe with some very basic and straightforward operations on SLPs that are essentially related to their representation. These operations are often used as building blocks inside more complex transformations. We start with the extraction of useful information about a given SLP. Concerning the variables in >>, we may build boolean lookup tables of length > for the following predicates: <\itemize> Is an input variable? Is an output variables? Is a constant? Is an auxiliary variable? Is modified ( does it occur as the destination of an instruction)? Is used ( does it occur as the argument of an instruction)? Does occur in the SLP at all ( as an input/output or left/right hand side)? We may also build lookup tables to collect information about the program itself. Let us focus on and recall that we use \Pbyte code\Q ,\,B|\|>-1>|)>> for the internal low level representation of . Every instruction =\,a,\,a|\|>>|)>> thus corresponds to a part of the byte code: for some offset > (with =0> and =o+|\|>+2> otherwise), we have >=\>, +1>=a>, +2>=a,\,B+|\|>+1>=a|\|>>>. For this representation, the following information is typically useful: <\itemize> An index table for the offsets ,\,o-1>>. For every destination or source argument >, the next location in the byte code where it is being used. Here we use the special values and |\|>> to indicate when it is not reused at all or when it is only used for the output. We also understand that an argument is never \Preused\Q as soon as it gets modified. For every destination or source argument >, the next location in the byte code where it is being assigned a new value or if this never happens. For every instruction, the maximal length of a dependency chain ending with this instruction. Clearly, all these types of information can easily be obtained using one linear pass through the program . We discussed several design options for the representations of SLPs in section. The options that we selected impose few restrictions, except for the one that all instructions have asingle destination operand. This freedom allows us to express low level details that matter for cache efficiency and instruction scheduling. However, it is sometimes useful to impose additional constraints on the representation. We will now describe several basic operations on SLPs for this purpose. The function computed by an SLP clearly does not depend on the precise numbering of the data fields in . If the input and output variables are pairwise distinct, then we may always impose them to be numbered as =0,\,I-1>=-1> and =,\,O-1>=+-1>: for this it suffices to permute the data fields. Similarly, we may require that all variables actually occur in the input, the output, or in the program itself: if \\\i> are the variables that occur in the SLP, then it suffices to renumber > into, for ,r-1>. When applied to the SLP from(), we have =k> for 8> and =k+4> for 8>, so we suppress useless data fields. A stronger natural condition that we may impose on an SLP is that every instruction participates in the computation of the output. This can be achieved using classical dead code optimization, which does one linear pass through in reverse direction. When combined with the above \Pdead data\Q optimization, we thus make sure that our SLPs contain no obviously superfluous instructions or data fields. Our representation of SLPs allows the same variables to be reassigned several times. This is useful in order to reduce memory consumption and improve cache efficiency, but it may give rise to unwanted dependency chains. For instance, the SLP() contains more dependency chains than(): the instruction \x+x> must be executed before \x\x>, but the instructions \x+x> and \x\x> can be swapped. In order to remove all dependency chains due to reassignments, we may impose the strong condition that all destination operands in our SLP are pairwise distinct. This can again easily be achieved through renumbering. Note that the non-optimized recording strategy from section has this property. In Remark, we discussed the in-place calling convention that is allowed by our formalism and we saw that aliasing may negatively impact the correctness of certain algorithms. For some applications, one may therefore wish to forbid aliasing, which can easily be achieved through numbering. This can be strengthened further by imposing all input and output variables ,\,I-1>,O,\,O-1>> to be pairwise distinct. A powerful optimization in order to avoid superfluous computations is common subexpression elimination (CSE). The result of each instruction in can be regarded as a mathematical expression in the input variables of our SLP. We may simply discard instructions that recompute the same expression as an earlier instruction. In order to do this, it suffices to build a hash table with all expressions that we encountered so far, using one linear pass through . In our more algebraic setting, this mechanism can actually be improved, provided that our domain > admits an algebraic hash function. Assume for instance that =\>> and let be a large prime number. Let ,\,x> be the arguments of the function computed by our SLP and let ,\,\> be uniformly random elements of >, regarded as asubset of the finite field =\/|p*\|\>>. Given any polynomial expression ,\,x|)>>, we may use ,\,\|)>\\> as an algebraic hash value for ,\,x|)>>. The probability that two distinct expressions have the same hash value is proportional to >, and can therefore be made exponentially small in the bit-size of and the cost of hashing. When using this algebraic hash function for common subexpression elimination (both for hashing and equality testing), mathematically equivalent expressions (like and or +2*x*y+y|)>*> and *-y|)>>) will be considered as equal, which may result in the elimination of more redundant computations. However, this algorithm is only \PMonte Carlo probabilistic\Q in the sense that it may fail, albeit with a probability that can be made arbitrarily (and exponentially) small. Due to its probabilistic nature, we do not use this method by default in our libraries. The approach also applies when=\>, in which case divisions by zero (in > but not in >) only arise with exponentially small probability. When implementing CSE in a blunt way, it is natural to use a separate data field for every distinct expression that we encounter. However, this heavily deteriorates the cache efficiency and reuse of data in registers. In order to preserve the quality of SLPs, our implementation first generates a blunt SLP with separate fields for every expression (but without removing dead code). This SLP has exactly the same shape as the original one (same inputs, outputs, and instructions, modulo renumbering). During asecond pass, we renumber the new SLP using numbers of the original one, and remove all superfluous instructions. Simplification is an optimization that is even more powerful than common subexpression elimination. Mathematically speaking, an SLP always computes a function, but differentSLPs may compute the same function. In particular, the output of CSE computes the same function as the input. But there might be even shorter and efficient SLPs with this property. In full generality, finding most efficient SLP for a given task is a difficult problem. This is mainly due to the fact that an expression may first need to be expanded before an actual simplification occurs: <\eqnarray*> ||+x*y+y|)>*++y+x|)>*-x|)>+x**>>||>|-x|)>++y-x*y-x|)>+*y-x*y+x-x*y|)>>>||>|-x*y+y-x*y+y-x>>||>|-x|)>*+y+1|)>.>>>> In our implementation, we therefore limited ourselves to simple algebraic rules that never increase the size of expressions and that can easily be applied in linear time using a single pass through the input SLP: <\itemize> Constant folding: whenever an expression has only constant arguments, then we replace it by its evaluation. Basic algebraic simplifications like x>, \0>, x\0>, x\x>, \x\-x>>, \x>>, \x*y>, \fms>, etc. Imposing an ordering on the arguments of commutative operations: x+y>, \fma>, etc. In addition to this rewriting rule, we also use ahash function for CSE that returns the same value for expressions like and . Simplification is a particularly central operation of the and libraries: we typically apply simplification directly after recording, automatic differentiation, and various other high level routines on SLPs. Efficiency is therefore a firm constraint. Note that our simplification procedures also comprise CSE. In the library, we also \Pharmonized\Q the set > of basic operations with the set of simplification rules by introducing all possible versions of signed multiplication: besides the four versions x\y\z> of signed FMAs, our signature > also contains the operation \-x\y>. The resulting uniform treatment of signs is particularly beneficial for complex arithmetic, linear algebra (when computing determinants), polynomial and series division, automatic differentiation, and various other operations that make intensive use of signs. Just before generating machine code for our SLP in the workflow(), it is useful to further \Pmassage\Q our SLP while taking into account specific properties of the target hardware. Register allocation is a particularly crucial preparatory operation. Consider a target CPU or GPU with registers. In our SLP formalism, the goal of register allocation is to transform an input SLP into an equivalent SLP with the following properties: <\itemize> \r> and the first data fields of are reserved for \Pregisters\Q. For all operations in > except \Pmove\Q, all arguments are registers. Moving registers to other data fields \Pin memory\Q or are thought of as expensive operations that should be minimized in number. In our implementations, we also assume that r> for all input/output variables and constants, but this constraint can easily be relaxed if needed. There is a rich literature on register allocation. The problem of finding an optimal solution is NP-complete, but practically efficient solutions exist, based on graph coloring and linear scanning. The second approach is more suitable for us, because it is guaranteed to run in linear time, so we implemented a variant of it. More precisely, using one linear pass in backward direction, for each operand of an instruction, we first determine the next location where it is used (if anywhere). We next use a greedy algorithm that does one more forward pass through. During this linear pass, we maintain lookup tables with the correspondence between data fields and the registers they got allocated to. Whenever we need to access a data field that is not in one of the registers, we search for a register that is not mapped to a data field or whose contents are no longer needed. If such a register exists, then we take it. Otherwise, we determine the register whose contents are needed farthest in the future. We then save its contents in the corresponding data field, after which the register becomes available to hold new data. This simple strategy efficiently produces reasonably good (and sometimes very good) code on modern hardware. In the future, we might still implement some of the alternative strategies based on graph coloring, for cases when it is reasonable to spend more time on optimization. The complexity of our current implementation is bounded by |)>>, because we loop over all registers whenever we need to allocate a new one. The dependence on could be lowered to by using heaps in order to find the registers that are needed farthest in the future. This optimization would in fact be worth implementing: for very large SLPs, one may regard different L1 and L2 cache levels as generalized register files. It thus makes sense to successively apply register allocation for, say, 16384>, 256>, and 32>. The efficiency of this optimization might be disappointing though if our SLP does not entirely fit into the instruction cache: in that case the mere execution of the SLP will become a major source of cache misses. Unfortunately, the above simple allocation strategy is not always satisfactory from the dependency chain point of view. For instance, consider the evaluation of a constant dense polynomial of high degree r> using Horner's rule. The greedy algorithm might systematically reallocate the last constant coefficients to same register. This unnecessarily puts a \Phigh pressure\Q on this register, which results in unnecessary dependencies. The remedy is to pick a small number \r>, say =4>, and blacklist the last -1> allocated registers from being reallocated. As a result, the pressure will be spread out over >registers instead of a single one. <\remark> ARM CPUs only support register arguments for arithmetic instructions. OnX86 processors, many instructions allow for one indirect memory argument. In that case, we can release our requirement that all instructions except \Pmove\Q only accept register arguments. Consider the following two SLPs: <\equation> ||||,b,\,a,b|)>>||,b,\,a,b|)>>>|a+a>|>|a+a>>|s+a>||b+b>>|\>||s+a>>|s+a>||t+b>>|b+b>||\>>|t+b>||\>>|\>||s+a>>|t+b>||t+b>>|>||>>>>> Both SLPs are equivalent and compute the sums a+\+a> and b+\+b>. However, the first SLP contains two long dependency chains of length , whereas these dependency chains got interleaved in the second SLP. Modern CPUs contain multiple execution units for addition. This allows the processor to execute the instructions s+a> and t+a> in parallel for ,n>, as well as the first instructions a+a> and b+b>. Modern hardware has also become particularly good at scheduling ( dispatching instruction to the various execution units): if is not so large, then the execution of \b+b> in the first SLP may start while the execution of a+a> or s+a> is still in progress. However, processors have a limited \Plook-ahead capacity\Q, so this fails to work when exceeds acertain threshold. In that case, the second SLP may be executed almost twice as fast, and it may be desirable to find a way to automatically rewrite the first SLP into the second one. A nice way to accomplish this task is to emulate what a good hardware scheduler does, but without being limited by look-ahead thresholds. For this, we first need to model our hardware approximately in our SLP framework: we assume that we are given > execution units ,\,E>>, where each > is able to execute a subset \\> of operations with prescribed latencies and reciprocal throughputs. We next emulate the execution of using these execution units. At every point in time, we have executed a subset of the instructions in and we maintain a list of instructions that are ready to be executed next (an instruction being ready if the values of all its arguments have already been computed). Whenever this list contains an instruction for which there exists an available execution unit, we take the first such instruction and dispatch it to the available unit. We implemented this approach in the library, but the outcomes of our experiments so far have been erratic and inconclusive. For some medium-sized SLPs, we observed an increased performance, but for large SLPs the performance often deteriorates sharply. The reason might be that rescheduling may radically modify the structure of our SLP, by permuting instructions that are very far away. This typically destroys data locality and cache performance. We plan to remedy this problem in future work. Note finally that it is often possible to interleave instructions from the outset (as in the right-hand SLP of()) when generating initial SLPs in the workflow(). For instance, if we just need to execute the same SLP twice on different data, then we may work over the domain > instead of > (see sections and below). More generally, good SLP libraries should provide a range of tools that help with the design of well-scheduled initial SLPs. Several operations in our signature > may not be supported by our target hardware. Before transforming our SLP into executable machine code, we have to emulate such operations using other operations that are supported. Within a chain of transformations of the initial SLP, there are actually several moments when we may do this. One option is to directly let the backend for our target hardware take care of emulating missing instructions. In that case, we typically have to reserve one or more scratch registers for emulation purposes. For instance, if we have an operation but no operation, then the instruction fms> can be emulated using two instructions \-c> and fma|)>>, where > stands for the scratch register. However, this strategy has two disadvantages: we diminish the number of available registers, even in the case of an SLP that does not require the emulation of any instructions. For SLPs that contain many instructions in a row, we also create long dependency chains, by putting a lot of pressure on the scratch register. Another option is to perform the emulation using a separate transformation on SLPs. This has the disadvantage of adding an additional step before the backend (even for SLPs that contain no instructions that need to be emulated), but it generally results in better code. For instance, for a long sequence of instructions as above, we may emulate them using > scratch variables ,\,\>> that are used in turn (very similar to what we did at the end of section). When doing the emulation before register allocation, the scratch variables need not be mapped to fixed registers, which allows for an even better use of the available hardware registers. It is finally an interesting challenge how to support certain types of instructions inside our SLP framework, such as carry handling, SIMD instructions, guarded instructions, etc. We plan to investigate the systematic incorporation and emulation of such features in future work. There exist several general purpose libraries for building machine code on the fly inside memory. This avoids the overhead of writing any code to disk, which is auseful feature for JIT compilers. In our SLP framework, where we can further restrict our attention to a specific signature > and a specific domain >, it is possible to reduce the overhead of this building process even further. Especially in our new library, we carefully used the C++ template mechanism for the construction of highly efficient code builders for various types and various targetCPUs. These builders can directly stream instructions to a sufficiently large chunk of memory (using similar conventions as the streaming mechanisms that we implemented for SLPs; see section). They can also directly build an entireSLP, provided that we already took care of register allocation and the emulation of any potentially missing operations. Our SLP library ultimately aims at providing a uniform interface for a wide variety of target architectures, including GPUs. Our current implementation in supports both and , but these high level C/C++-like languages rely on slow compilers to generate the actual machine code. Unfortunately, vendors of GPUs provide surprisingly little information about their hardware and the binary instruction layout. Fortunately, we only need a tiny subset of the available operations, which might allow us to retro-engineer the layouts for this subset on some of the most popular GPUs. We should then be able to generalize our low-level streaming builders to also support these GPUs. Building efficient machine code for our SLPs is only interesting if we wish to evaluate our SLPs many times. When all input data for these evaluations are available beforehand, then this evaluation task becomes embarrassingly parallel. The SLP library should easily allow us to exploit the available parallelism on the target architecture. If our hardware supports SIMD parallelism of width for our domain >, then we systematically support builders for the SIMD type > and provide a wrapper for doing multiple evaluations of our SLP approximately times faster. In the case of GPUs, may become as large as the number of available threads. For modern CPUs with multiple cores, we also provide a multi-threaded interface for evaluating SLPs many times. The implementation of an efficient and versatile SLP library requires a lot of efforts. Someof this work might be useful beyond the strict framework of SLPs. In this , we briefly mention some low hanging fruit within hand's reach. The SLP framework can be overkill for some applications. The typical workflow() is indeed relatively long: we first construct anSLP, then transform and optimize it, then prepare it for the backend (register allocation, emulation, etc.), and finally generate the machine code. The full building cost is often about thousand times higher than the cost of a single execution of the final machine code. For particularly simple programs, it is tempting to bypass the entire SLP layer and directly build the machine code. This might for instance be useful for the repeated evaluation of dense or sparse polynomials. We also plan to implement special shortcuts for bilinear and trilinear SLPs. Such SLPs can often be evaluated more efficiently for transformed representations of the inputs and outputs ( modulo several primes or in a suitable FFT model). One may even considering rewriting general SLPs to benefit from such accelerations4.4.2 and4.4.3>. In other circumstances, we may miss basic control structures. One common application is the repeated evaluation of an SLP on different inputs, as discussed at the end of the previous subsection. Instead of generating a function for our SLP and calling this function inside a loop, one may eliminate the overhead of function calls by putting this loop directly around the code of our SLP. This is also useful for the generation of efficient codelets for discrete Fourier transforms(see section below). Another common design pattern is the repeated evaluation of an SLP until some condition is met. This is for instance useful for numerical homotopy continuation(see section). The recording strategy from section also reaches its limits when the resulting SLP is so large that the corresponding machine code does not fit into the instruction cache. In that case, we may wish to automatically factor out large identical sub-SLPs (up to renumbering) and automatically replace them by function calls. This could be achieved by tagging function calls during recording or by using a dedicated post-processing step. We plan to extend our builders to generate light-weight for the above purposes and gracefully intertwine it with the machine code for the SLPs themselves. It is not yet clear to us yet how far these extensions should go. For particularly useful design patterns, it will be faster to develop dedicated implementations and optimizations. Butif we need too many types of control sugar, we may end up with a general purpose compiler. As a general rule, there is clearly a trade-off between the speed of generating machine code and the quality of the generated code (and its speed of execution). The best choice essentially depends on the number of times that we wish to execute our code. An ideal library should be able to efficiently support the largest possible range of \Pnumbers of executions\Q. The full strength of the SLP framework becomes apparent for algebraic transformations. Traditional compilers may exploit some algebraic properties during the optimization and code generation phases, but this typically remains limited to the kind of constant folding and basic rewriting rules that we have discussed in section. One reason is that general purpose languages possess no deeper knowledge about mathematical types that a user might introduce. Since there is typically no way to easily compute with programs themselves, it is also not straightforward to extend the compiler with such knowledge. Another reason is that more sophisticated algebraic optimizations are typically easier to design for SLPs than for full blown programs in a high level language. In this section we discuss operations on SLPs that exploit mathematical properties of the domain >. The algorithms still use rewriting and static analysis techniques from the traditional theory of compilation, but also heavily rely on mathematical manipulations that can commonly be found in computer algebra systems. We will first recall a few classical techniques like vectorization and automatic differentiation. We next introduce the concept of SLP algebras which provides a general framework for the design of algebraic transformations. We finish with several less classical transformations and examples how to use SLP algebras. A simple but useful transformation is SIMD-style vectorization. In a traditional language that supports templates, like C++, , or , we can implement atemplate type> for vectors of length with entries in >. Given a function \\> that is templated by>, the instantiation of this function for > then allows for the -fold evaluation of , provided that this instantiation is legit. In the SLP setting, vectorization is also straightforward to implement: it essentially suffices to duplicate all inputs, outputs, data fields, and instructions times. For instance, consider the SLP <\equation> ,x|)>>>|\1>>|\fma,x,x|)>>>|\fma,x,x|)>>>||)>.>>>>> Vectorizing this SLP for yields the following SLP: <\equation*> ,x,x,x|)>>>|\1>>|\1>>|\fma,x,x|)>>>|\fma,x,x|)>>>|\fma,x,x|)>>>|\fma,x,x|)>>>|,x|)>.>>>>> Note that this vectorized SLP has a high quality in the sense of section, because dependency chains are well interleaved. Whenever an SLP needs to be executed several times, it may therefore be beneficial to vectorize it this way (for a small like ) and then perform the execution by groups of . Although most compilers do not directly integrate support for automatic differentiation, dedicated tools for this task are available for many languages; see . Automatic differentiation can be done in two modes: the forward mode (differentiating with respect avariable) and the backward mode (computing gradients). The forward mode is the easiest to implement. Mathematically speaking, in order to differentiate \\;,\,x|)>\f,\,x|)>> with respect to, say, ,\,x> with n>, itsuffices to evaluate in the algebra =\,\,\|]>/*\\1\i\j\k|)>>: <\equation*> f+\,\,x+\,x,\,x|)>=f,\,x|)>+ f|\ x>,\,x|)>*\+\+ f|\ x>,\,x|)>*\. The forward mode can therefore be implemented directly in the language itself, provided that the language allows us to implement as a template and instantiate it for >. Forward differentiation can be implemented for SLPs in a similar way as vectorization: with the notation from the section, we now take k+1> and replace every instruction by one or more instructions according to the action of> in >. For instance, for , the forward derivative of the SLP() is as follows: <\equation*> ,x,x,x|)>>>|\1>>|\0>>|\fma,x,x|)>>>|\fma,x,x|)>>>|\fma,x,x|)>>>|\fma,x,x|)>>>|\fma,x,x|)>>>|\fma,x,x|)>>>|,x|)>.>>>>> The value of ,x|)>\x+x+1> and its derivative with respect to > can be computed simultaneously by evaluating this SLP at,1,x,0|)>>>. Baur and Strassen also proposed an efficient algorithm for automatic differentiation in the backward mode. Their method relies on applying the chain rule in backward direction. As a consequence, it can not be implemented by evaluating in the usual forward direction over some suitable >-algebra. Instead, specific compiler support or adedicated program is required in order to use this method. In the SLP setting, where programs are treated as data, the forward and backward modes can be implemented in a similar way; we only need to reverse the direction in which we process and apply the chain rule. Automatic differentiation in both directions typically generates a lot of trivial instructions of the kind b+0> or fma>. It is therefore recommended to always simplify the resulting SLP as explained in section. In fact, automatic differentiation is an excellent test case for simplification routines. Note also that it is less straightforward to benefit from simplification when implementing forward differentiation using templates in a general purpose language, as described above. Assume that > is a ring and consider a linear map \\> that can be computed by an SLP of length >. Then the states that the transposed map >:\\\>> can be computed by another SLP of length +O>. The principle can be traced back to, in a different terminology of electrical networks. An early statement in terms of computational complexity can be found in4 and5>. See for some applications in computer algebra. Like in the case of backward differentiation, the transposition principle can be implemented using one backward linear pass. Sometimes, the input SLP is only linear in asubset of the input variables, in which case we may apply the transposition principle with respect to these variables only, while considering the other input variables as parameters. Backward differentiation can then be regarded as a special case of the transposition principle, the SLP together with its Jacobian being linear with respect to the variables that correspond to infinitesimal perturbations. <\example> Consider the following SLP: <\equation> ,a,b,b|)>>>|\a+a>>|\b+b>>|\a\b>>|\a\b>>|\fms,b,c|)>>>|\c-c>>|,c,c|)>>>>>> This SLP \\> computes the product +c*X+c*X=+a*X|)>*+b*X|)>> using Karatsuba's algorithm8.1>. For fixed ,b>, the map ,b>:\\\;,a|)>>\f,a,b,b|)>> is linear in ,a>. The transposed map ,b>>:\\\;>,>,>|)>\>,>|)>> computes the so-called of >+>*X+>*X> and +b*X>. We may obtain an SLP for the map \\;>,>,>,b,b|)>\f,b>>>,>,>|)>> in three steps: \; <\equation*> ||||||||,x>,,x>|)>>>|\x+x>>|\x+x>>>|\x\x>>|\x\x>>|\fms,x,x|)>>>|\x-x>>|\x>>|\x>>|\x>>|,x,x>|)>>>>>>\|||||||||||||,x,x>,,x>|)>\>|>|\0>|\x-x>>|\x+x>>|\fma,x,x|)>>>|\x>|\x-x>>|\x>|\x>>|\x>|\fma,x,x|)>>>|\x>|\x>>|\x>|\fma,x,x|)>>>|\x>|\x>>|\x+x>|\x+x>>|\x+x>|\x+x>>|\x+x>|\x>>|>|,x>|)>>>>>>\,x,x>,,x>|)>>>|\x+x>>>|\x-x>>|\x\x>>|\x-x>>|\fma,x,x|)>>>|\fma,x,x|)>>>|,x>|)>>>>>>\ The left-hand SLP is a renumbered version of() to which we appended three instructions to copy the output (without this, the transposed SLP would override the input arguments). The middle SLP shows its raw unsimplified transposition. The background colors indicate the images of the individual instructions. For instance, the instruction \x-x> is equivalent to <\equation*> >>|>>>>>\|>||>>>>*>>|>>>>>, which leads to the transposed code \x-x>. The instruction \x+x> is treated apart, because it only depends on the parameters ,x> and not on the main input variables ,x>: all instructions of this kind are extracted from the input SLP and prepended to the transposed SLP. For the transposition itself, the parameters >, > and variables like > are considered as constants. The right-hand SLP shows the simplification of the raw transposed SLP. We both applied usual simplification as in section and an optimization that combines multiplications and additions into FMA instructions. Consider a rational map \\>. Its is the rational map :\\\> such that ,\,x,t|)>> is homogeneous with ,\,x,1|)>=f,\,x|)>> for ,m>>. The homogenization of an SLP can be constructed using a single linear pass, during which we keep track of the degrees of all intermediate results that were \Pcomputed\Q so far. The algorithm is best illustrated on an example: <\equation*> >|>|deg y\1>|>|>>|x+y>||1>||x+y>>|a\a>||2>||a\a>>|b-y>||||\y\t>>|||2>||b-y>>|fma>||||t\t>>|||||\y\u>>|||3>||fma|)>>>|x/a>||-2>||x/a>>|b+7>||||\b\u>>|||0>||b+7>>|>||||>>>>> The left-hand side shows the input SLP and the right-hand side its homogenization. We also indicate the degrees of all intermediate results. During additions, subtractions, andFMAs, we multiply with appropriate powers of to keep things homogeneous. We also maintain a table of powers of that were computed so far, in order to avoid needless recomputations. This procedure actually extends to other operations, as long as the notion of degree extends in a reasonable way. For instance, for the operations , >>, and |\|>>, we may take \max >, \*deg f>, and \deg f>. Let > be a fixed signature and let >, > be domains. We say that > is a domain > if we have afunction :\\\> in order to reinterpret elements in > as elements in >. comes with various fundamental template types that allow us to construct domains over >, like the domain > from section or the domain > from section. As we saw in section, the library uses an abstract data type and also provides constructors for many basic domains > over >. Now consider the case when elements of > are represented as vectors in > and that > is implemented as an SLP >:\\\>. Assume also that every operation \\> is implemented on> as an SLP >:\|\|>>\\>, where inputs |)>d>|)>|\|>>\|)>|\|>>> are represented as ,\,x,\,x|\|>-1,0>,\,x|\|>-1,d-1>|)>\\|\|>>>. Then we call > an and call its over >. The recording technique from section yields a convenient way to turn any domain> over > into an SLP algebra: we obtain >> and every>> by recording :\\\> and :\|\|>>\\>> over >. Conversely, > and > can be obtained from>> and>> through evaluation. <\remark> The examples of > and > from sections and are indeed >-algebras in the mathematical sense. In general, the signature > typically contains the algebra operations, but we do not insist on > to verify all mathematical axioms of an >-algebra. Let > be a domain over >, whose elements are represented as vectors in >. Given an SLP over >, we may naturally execute over >. Using the recording technique from section, we may then obtain a new SLP, which we call the of as an SLP over >. Similarly, since every operation in \\> can be applied over both > and >, and constants in > can be lifted to a constant in >, any SLP over > can be evaluated over>. Recording the trace of such an evaluation, we obtain a new SLP that we call the of to>. Note that this is still an SLP over >. If > is actually an SLP algebra over >, then SLPs can be reinterpreted or lifted more efficiently using >> and the >>, while bypassing the recording technique. In sections and, this is what we did in an way for vectorization and forward differentiation. But something similar can be done to lift SLPs to general SLP algebras >. Given an input SLP >, we first renumber all data fields in >> and the >> such that they carry numbers *d> and such that each operations acts on separate data fields (some optimizations are actually possible here, but this is not crucial for the general idea). Essentially, it then suffices to replace every instruction ,a,\,a|\|>>|)>>> in the input SLP by >>, while substituting *d,\,a*d+d-1,\,a|\|>>*d,\,a|\|>>*d+d-1> for the input arguments and *d,\,a*d+d-1> for the output arguments. The constant fields are lifted using >>. However, additional adjustments need to be made for instructions that involve aliasing. For instance, assume that =\|]>> is the algebra of complex numbers over>, where>> is implemented using the right-hand SLP from(). Assume that ,z> are variables that get lifted into ,y,x,y>>. Then the instruction \z\z> would be lifted into <\equation> \x\x>>|\x\y>>|\fms,y,x|)>>>|\fma,y,y|)>,>>>>> which is incorrect due to the aliasing. We thus have to do some additional renaming and/or copying in order to restore the correctness. In our example, it suffices to replace() by <\equation*> \x\x>>|\x\y>>|\fms,y,\|)>>>|\fma,y,\|)>,>>>>> where ,\> are separate scratch variables. This corresponds to replacing our SLP() for complex multiplication by the following one in presence of aliasing: <\equation> ,y,x,y|)>>>|\x\x>>|\x\y>>|\fms,y,\|)>>>|\fma,y,\|)>>>|,y|)>>>>>> In a similar way as at the end of sections and, we should also avoid putting too much register pressure on the scratch variables ,\>. This can be done by circulating the scratch variables among > possibilities, as usual. Note that it can be detected automatically whether an SLP like the right-hand of() needs to be corrected when some of the arguments are aliased. Moreover, the appropriate correction may depend on the precise arguments that are aliased. For instance, consider the following implementation of complex FMAs: <\equation*> ,y,x,y,x,y|)>>>|\fma,x,x|)>>>|\fma,y,y|)>>>|\fms,y,x|)>>>|\fma,y,y|)>>>|,y|)>>>>>> A similar correction as in() is needed when lifting fma>, but no correction is needed when lifting fma>. In the library, we dynamically construct and maintain a table of corrections for each kind of aliasing. An interesting direction for future research would be to automatically combine certain instructions and lift them jointly. This would allow for heavier optimizations of the\Pjoint lift\Q of the combined instructions. For instance, assume that we wish to lift an SLP that repeatedly contains the following pattern: <\eqnarray*> |>|>|>|>|\c>>|>|>|\c>>||>|>>> where ,b,c,a,b,c> are pairwise distinct. We may view this pair of instructions as ajoint instruction <\eqnarray*> ,a|)>>||\c,b\c|)>.>>>> This instruction might be lifted as a single instruction over the SIMD type >. As highlighted in section, this typically yields a \Pjoint lift\Q of better quality than sequentially lifting the instructions \b\c> and \b\c>. More generally, we expect rescheduling as in section to be an interesting optimization on this kind of joined lifts. Ball arithmetic is a variant of interval arithmetic. Given a numerical algorithm and an input, the aim is to automatically compute a rigorous bound for the error of the output. In our case, the algorithm will be given by an SLP. The idea behind ball arithmetic is to systematically replace approximate numbers by balls that enclose the true mathematical values. According to this semantics, a of an operation > (which we will still denote by >) takes balls ,\,\|\|>>> as inputs and produces a ball ,\,\|\|>>|)>> that satisfies the : <\equation*> x\\,\,x|\|>>\\|\|>>\\,\,x|\|>>|)>\\,\,\|\|>>|)>. In order to construct such ball lifts, we also have to take into account rounding errors, when working with finite precision. For instance, let > be the set of standard IEEE-754 double precision floating point numbers. Given an operation > on real numbers, let >> denote its correctly rounded variant on >. Given \> and \>>, let > be the ball with center and radius. Then the standard arithmetic operations can be lifted as follows: <\eqnarray*> \\>|>|>b,>s|)>+>\>b|)>|)>>>|\\>|>|>b,>++>r|)>\>s|)>+>\>b|)>|)>,>>>> where \>, \>>, and for all \>, <\eqnarray*> >|>|\>2+>2.>>>> Other operations can be lifted similarly. Taking \\,\|)>\\x\\,r\>>|}>>, this allows us to implement > as an SLP algebra (note that > is not an algebra in the mathematical sense though: remind Remark). Lifting an SLP over > to >, we obtain a new SLP > that computes the usual numerical evaluation of , together with arigorous error bound for the obtained approximate value. The SLP > obtained through direct lifting has the disadvantage that a lot of time is spent on evaluations of >, on bounding rounding errors. It turns out that this can entirely be avoided by slightly increasing the radii of input balls. The precise amount of inflation can be computed as a function of the input SLP: see5> for a formula. The fact our program is an SLP allows us to do this computation , while building the output program > and not merely while executing it. This is not possible for general purpose programs that may involve loops of unpredictablelengths or recursions of unpredictable depths. Ball arithmetic can be generalized to balls with centers in more general algebras such as the complexification |]>> of >. However, the formula \+y>>|0.5ex>> for the absolute value of a complex number > involves a square root, which is fairly expensive to compute. In, we proposed to compute these absolute values using the formula =\> whenever we need to lift a product . In we implemented this as an optimization. An alternative approach is to implement aseparate SLP algebra for \Pcomplex balls with absolute values\Q. Such balls are of the form ,r;a|)>>, where \>, \>>, and |\|>>. Ignoring rounding errors, we may implement the standard arithmetic operations as follows: <\eqnarray*> \\>|>|>v,r+>s;>v|\|>>|)>>>|\\>|>|>v,>b+>r|)>\>s|)>;a\>b|)>.>>>> After lifting an SLP over > to |]>,\;\|)>>, we may again \Pforget\Q about the absolute values and project the output down to |]>,\|)>>. Removing all dead code from the resulting SLP (see section), we essentially obtain the same SLP as using the implementation from . The advantage of the alternative approach is that its correctness is more straightforward. It also directly extends to other operations such as division or square root. On the downside, it is a bit slower to first insert instructions that compute unneeded absolute values, only to remove them later during the dead code elimination. There are several ways to implement modular arithmetic in /|p*\|\>> when 2>. We refer to for a survey of efficient methods that can benefit from SIMD acceleration. In this section, we will use a variant of modular arithmetic that is based on machine double precision arithmetic, but the material can easily be adapted to other variants. As in the previous subsection, we assume that our hardware implements the IEEE754 standard for floating point arithmetic and we use rounding to the nearest. Recall that integers in >,>|]>> can be represented exactly in >. We assume that our modulus is odd with p\>> and also assume that \|)>> has been precomputed. We represent a residue modulo as an integer in > between > and>. The normal form >\r\>> of an integer \a\2> can be computed efficiently using the operation as follows: <\equation> |||>|u>>||>||q|\>>>||>|.>>>>> Here |q|\>> stands for the closest integer near . Abbreviating this code by reduce>, this naturally leads to the following algorithms to compute the modular sum, difference, and product of and : <\equation> |||>|>||>|>>>>>\|||>|>||>|>>>>>\|||>|b>>||>|>>||>|>>||>|>||>|>>>>> Let > be the SLP algebra over >, where the arithmetic operations are implemented in this way. In the algorithms(), we note that the bulk of the computation time is spent on modular reductions. A common technique to avoid reductions is to delay them as much as possible>. This can be achieved by switching to a or representation of residues modulo by integers in the range >,>|]>>. In the above algorithms(), we may then suppress the final reduction reduce>, whenever we are able to guarantee that \>>. In the contrary case, we will say that the operation for the corresponding inputs and . Let > be the SLP algebra that corresponds to the above kind of non-normalized modular arithmetic: <\equation> |||>|>>>>\|||>|>>>>\|||>|b>>||>|>>||>|>>||>|>>>> We also extend > with an operation \P\Q that is implemented as the identity operation by default and as modular reduction() on >>. Now consider an SLP over > with inputs ,\,a-1>\\/|p*\|\>>. In absence of overflows, its execution over> produces the same output modulo as its execution over >. The correctness of the optimized execution over > can be restored in presence of overflows by inserting a reduction after every instruction that may cause an overflow. How to determine and correct the instructions that may cause overflows? In4.1>, we described a simple greedy algorithm for doing this. Roughly speaking, we start with the bounds |\|>,\,-1>|\|>\||\>> for the absolute values of our inputs. Using alinear pass through our SLP, we next compute bounds for the results of every instruction: for additions and subtractions a\>b>, we have \+>; for multiplications a\>b>, we have \||\>+**p/2>>. Whenever the bound for > exceeds >, we insert the instruction reduce>, and replace the bound by ||\>>. Optionally, we also insert instructions to reduce all output values at the end of the SLP. The redundant representation for the residues modulo allows for some additional optimizations. For instance, we may use the following algorithm for the fused operation fma> modulo : <\equation*> |||||>|>>||>|>>||>|>||>|>>||>|>>>> Although this implementation is not faster than the default implementation (a modular multiplication followed by a modular addition), it yields a better bound for the result: we obtain \||\>+**p/2>> instead of \||\>+**p/2>+>. This optimization may therefore help to lower the number of reductions that need to be inserted in order to counter potential overflows. Note that this subtle improvement is hard to find automatically and also requires us to include the \Pfma\Q operation into > (remember the discussion in section). Let us finally investigate the role of the modulus . In a typical C++ implementation, this is either a template parameter (in which case a recompilation is necessary in order to use a new ) or we pass it as an additional function argument (in which case, we cannot exploit special properties of at compile time). In our SLP setting, we dynamically generate specific code for any required modulus , while relying on the ability to \Pefficiently generate disposable machine code\Q. The specific value of was for instance used during the above bound computations. Certain simplifications may also exploit it, if a constant happens to be a multiple of plus zero, one, or minus one. We may sometimes prefer to create SLPs for generic moduli . This remains possible, by treating as an input variable when recording. Of course, special restrictions will apply in this case, since it is no longer possible to do static optimizations that take into account the specific value of . For instance, it is necessary to tweak the above procedure to counter overflows: the bound \||\>+**p/2>> for multiplication must be replaced by \|>|2>|\>+**>/2>>, where >> is a fixed upper bound for the target moduli. There are various ways to conduct numerical computations with a precision that exceeds the hardware precision. Popular libraries like and represent unsigned multiple precision integers as -1>a*2>, where ,\,a-1>\\>> are machine integers that are often called . The efficient implementation of basic arithmetic then relies on hardware support for carry handling. For instance, the addition of two numbers of \3> limbs is implemented as follows on traditional hardware: <\equation> ,a,a,b,b,b|)>>|>|>|,a,b|)>>||>>|,a,b|)>>||>>|,a,b|)>>||>>|,d,d|)>>||>>>> Unfortunately, this code is not an SLP according to our definition, because the semantics of the carry handling is implicit: if this were an SLP, then the three additions would be independent, so permuting them in any order would yield an equivalent SLP. Various approaches to reconcile carry handling with our SLP framework will be presented in aforthcoming paper. Of course, one may also directly generate assembly code for programs like(), as a function of the desired extended precision. Another disadvantage of the reliance on carries is that SIMD units do not always support them. For instance, neither , , nor provide a vector version of \Padd with carry\Q. GPUs do support carries, but only through. One remedy to this problem is to use a redundant representation of multiple precision integers, like -1>a*2> with ,\,a-1>\\>>, and reserve the highest bits of ,\,a-1>> for carry handling. This representation is particularly interesting on recent and processors with support for the IFMA instructions: the unnormalized SIMD multiplication of two such integers can then be done in only > cycles. On CPUs and GPUs without efficient carry handling or IFMA instructions, one may rely on floating point arithmetic for efficient SIMD implementations of extended precision arithmetic. This is particularly interesting if the hardware favors floating point over integer arithmetic. In, we showed how to efficiently implement medium precision fixed point arithmetic in this way. Fixed point numbers can be represented redundantly as -1>a*2|)>*k>>, with ,\,a-1>\\\>*2|)>>>, where \\> is a small number thatcorresponds to the number of \Pcarry bits\Q (one typically has \\6>). The unnormalized SIMD multiplication of two such numbers takes approximately *\-*\> cycles. The techniques from can be adapted to multiple precision integer arithmetic. The above fixed point numbers form an SLP algebra, when > and > are fixed. As in the case of modular arithmetic, there is an operation \P\Q to put numbers in an almost normal form with |\|>\> for ,\-1>. Using a similar greedy algorithm as in the previous subsection, we may detect potential overflows and deal with them by inserting appropriate reductions. In the case of fixed point arithmetic, this gives rise to a trade-off: if we increase > by one, then we may lower the number of reductions, but also lose > bits of precision. We may next develop floating point arithmetic on top of integer or fixed point arithmetic by managing all exponents ourselves. This approach is for instance used in and , but it is less suited for SIMD parallelization. Shift operations on mantissas are indeed non-trivial to implement in an SIMD fashion: see for some experiments in that direction. For the evaluation of an SLP on a bounded region, an alternative idea is to statically compute bounds for all exponents encountered during the computation. Ifwe allow mantissas to be non-normalized ( the leading bits could be zero), then we may use these bounds (possibly rounded up to a multiple of the bit-size of limbs) as our exponents and greatly reduce the cost ofshifting. Another way to extend the precision of floating point arithmetic is to use compensated algorithms such as 2Sum. In that case, a floating point number with >fold precision is represented as +\+x-1>>, where ,\,x-1>> are hardware floating point numbers that do not overlap (for double precision, this typically means that |\|>\2*|\|>> for ,\-1>). Algorithms for basic arithmetic operations in doubled precision (=2>) go back to. We refer to for generalizations to higher precisions. For a fixed value of >, this type of arithmetic again gives rise to an SLP algebra. Since the representation is normalized, no special reductions are needed to counter carry overflows. (However, additional efforts are required in order to fully treat floating point errors, overflows, and underflows.) Unfortunately, the overhead of the compensated algorithms approach tends to increase rapidly as a function of >, which limits its practical usefulness to the case when \4>. In this section, we discuss several applications for which we started using or plan to use our SLP libraries. We will highlight the added value of the SLP framework, and also investigate limitations and ideas to work around them. The list of selected applications is by no means exhaustive, but it will allow us to cover a variety of issues that are relevant for the development of SLP libraries. Numerical homotopy continuation is the archetype application of SLPs. Assume that we wish to solve a polynomial system <\equation> P,\,z|)>=\=P,\,z|)>=0, where ,\,P\\,\,z|]>>. Assume also that we can construct a \Psimilar\Q system <\equation> Q,\,z|)>=\=Q,\,z|)>=0 that is easy to solve. This typically means that both systems have the same number of solutions and the same total degrees: =deg Q,\,deg P=deg Q>. Now consider the following system that depends a \Ptime\Q parameter : <\eqnarray*> ,\,z,t|)>>|>|*P,\,z|)>+t*Q,\,z|)>=0>>||>|>|,\,z,t|)>>|>|*P,\,z|)>+t*Q,\,z|)>=0.>>>> For and this system reduces to() and(), respectively. The idea is then to follow the solutions of() from to . This can be achieved using various \Ppath tracking\Q algorithms. The most straightforward method uses Euler\UNewton iterations: given an approximate solution > at time , we get an approximate solution |)>> at time > using <\equation> |)>\- H|\ z>,t|)>|]>,t|)>+ H|\ t>,t|)>*\|)>. This formula can be used both as a predictor (with \0>) and as a corrector (with =0>>). One typically alternates prediction and correction steps. In order to control the step-size>, one uses a predicate for deciding whether a step needs to be accepted: in case of acceptance, we increase the step-size (say \1.2*\>); otherwise, we decrease it(say \\/2>). A simple predicate that one may use is <\equation> | H|\ z>,t|)>|)>* H|\ z>|)>,t+\|)>|\<\|\|\>>\, but other predicates may be suitable for certain systems. The quality of the first order approximation() can also be improved: if we have approximations |)>> and |)>> of the solution > at two successive times >, >, together with approximations |)>> and|)>> of the gradient \- H|\ z>,t|)>|]>* H|\ t>,t|)>>, then we may fit > using apolynomial of degree or , and use that to obtain a higher order approximation at anew time >. In the context of this paper, it is convenient to regard our systems as zeros of polynomial mappings \\>, \\>, and \\> that are given through SLPs. Using automatic differentiation, we may then construct SLPs for H|\ z>> and H|\ t>>. We may also record linear algebra routines for matrix multiplication, inversion, and norm computation. This enables us to construct SLPs for steppers() and verifiers(). \ The SLP framework has several advantages here. If our polynomials are sparse, then evaluation in the SLP representation may be significantly faster. Moreover, the sparsity is conserved by automatic differentiation. Even in the dense representation, the monomials are shared subexpressions between ,\,P>, so the evaluation of is typically faster than the separate evaluation of ,\,P>. Furthermore, we generate machine code for the evaluation, which is faster then using a loop over the terms of an expanded representation of . Full fledged SLP libraries also provide a lot of flexibility concerning the domain>. Usually, we take \\|]>> to be the domain of machine complex floating point numbers at double precision. But we may easily lift our SLPs to extended precision, as we saw in section. The techniques from section also allow us to efficiently evaluate SLPs over balls and then use this to construct certified path trackers. Generating machine code for evaluating a given SLP over > is fast, provided that we have efficient implementations for the methods from sections, and. In the SLP framework such codelets can thus be constructed on-the-fly whenever needed for the resolution of our specific input system. Similarly, if a system has multiple solutions, then automatic differentiation can be used to \Pdeflate\Q it on-the-fly into a system which turns these solutions into isolated ones. Numerical homotopy methods lend themselves very well to parallelization, since every separate path can be tracked using a different thread. SIMD acceleration of width may also be used to simultaneously track paths using a single thread. However, the required number of steps may differ from one path to another, so an mechanism is needed to detect complete paths and to feed the corresponding threads (or SIMD lanes) with new paths. The strict SLP framework is slightly insufficient for implementing such a mechanism. , when generating machine code for our SLP, we wish to be able to surround it by a loop that allows for multiple executions until some condition is met. As briefly discussed in section, this kind of control sugar is a welcome extension of the framework from this paper. Another major application of SLP libraries is the automatic generation of codelets for discrete Fourier transforms (DFTs) and variants. The codelet approach was pioneered in FFTW3 and later applied in other implementations. Given any desired order, the idea is to automatically generate a dedicated and highly optimized codelet to perform DFTs of this order. For small orders, the appropriate codelet is typically unrolled as an SLP and pre-compiled into the library. For larger orders, the idea is to decompose the transform into DFTs of smaller orders using classical techniques due to Good, Cooley\UTukey, Rader, etc. There are often many competing algorithms to compute aDFT, in which case we automatically compare running times and select the best strategy for the target length. This tuning phase may be expensive, but the best strategies can be cached. In order to achieve really high performance, it is actually profitable to unroll DFTs up till\Pnot so small\Q orders like . When pre-compiling all these \Pnot so small\Q DFTs into the library, this may considerably blow up its size. This is all the more true if one also wishes to include variants of DFTs such as the truncated Fourier transform>. In the SLP framework, one may efficiently build and compile these codelets on-the-fly, while preserving or even improving the efficiency at execution time. The SLP approach is also more versatile, because it allows us to dynamically choose our ground ring >. Traditional DFTs over> may therefore be generated on-the-fly for various working precisions. But one may also generate number theoretic transforms over =\> for various prime numbers , and Nussbaumer polynomial transforms over =\/\1|)>> for various degrees . In addition, we may apply algebraic optimization techniques from section, while exploiting particular mathematical properties of the ground ring>. This idea has been applied successfully in for DFTs in medium precision and number theoretic transforms over >, >>, and > for so-called smooth FFT primes . A few technical issues need to be addressed though in order to use an SLP library for this application. First of all, the design of DFTs of large orders heavily relies on tensor products: given an SLP \\> that computes a DFT (or another linear map) of order, we often need highly optimized code for \f:\\\> andId:\\\>>. If is small, then we simply generate SLPs for these tensor products. However, if 64>> and are both \Pnot so small\Q, then it is best to automatically put the code of inside a loop of length . This is another instance where control sugar is useful (see also sections and). Moreover, we may wish to pass as a parameter to the functions \f> and Id>. Another issue is to find the best strategy for computing a particular DFT. Our SLP library must therefore contain support for benchmarking competing strategies on-the-fly and selecting the best ones. Such a mechanism can be expensive, so this should be complemented by a caching system that remembers the best strategies. This is a non-trivial problem that we have not yet addressed in our present implementations. For instance, for DFTs of a fixed length over >, should we use the same strategy for different values of? We might also attempt to cleverly bench a reduced number of DFTs and try to build a model that can predict the running time of other DFTs. Consider the multiplication of two formal power series \|]>>. From an algorithmic perspective, we may deal with the infinite nature of power series in several ways. One idea is to work at a finite truncation order: given =f+\+f*z> and =g+\+g*z>, we then wish to compute =h+\+h*z>. Another approach, the (or ) one, is to regard and as streams of coefficients and to insist that <\equation> h=f*g+\+f*g needs to be output as soon as ,\,f> and ,\,g> are known. The naive way to compute the product in a relaxed manner is to directly use(). At order three, this yields <\eqnarray*> >||*g>>|>||*g+f*g>>|>||*g+f*g+f*g.>>>> But we may save one multiplication using Karatsuba's trick: <\eqnarray*> >||*g>>|>||+f|)>*+g|)>-f*g-f*g>>|>||*g+f*g+f*g.>>>> In fact, Karatsuba's algorithm is essentially relaxed, modulo reordering the instructions in an appropriate way. However, this reordering is non-trivial to program and the appropriate control structures induce a significant overhead. Our SLP framework has abig advantage here, since, no matter how complex these control structures are, theoverhead vanishes when the multiplication. <\example> Relaxed multiplication has the property that the coefficients > and > of the inputs are allowed to depend on earlier coefficients ,\,h> of the output. This yields a convenient way to solve power series equations. For instance, the inverse of a power series of the form satisfies the equation <\equation*> f*u=1+z*v-z*g-z*v*g=1, whence <\eqnarray*> ||>|>||+=g+v*g+\+v*g\1|)>.>>>> Now assume that we compute using the formula and Karatsuba's algorithm. When recording the trace of an expansion at order , we obtain the following SLP, after simplification: <\equation*> |,g,g,g|)>>|>|>>|\g>||\fma,v,g|)>>>|\v\g>||\v-h>>|\g+h>||\v-h>>|\v\g>||\fma,g,g|)>>>|\g+g>||\v+h>>|\v+v>||\fma,g,g|)>>>|>||,v,v,v|)>>>>>> Trading one multiplication against several additions and subtractions may not seem worth the effort, except maybe for larger orders. However, when lifting the SLP to adomain like > from section, multiplications may be far more expensive than additions and subtractions. Even for small orders, the relaxed approach may then lead to significant savings. If \\>, then the relaxed approach can in particular be used in order to solve asystem of ordinary differential equations (ODEs) of the form <\eqnarray*> >||,\,y|)>>>||>|>|>||,\,y|)>,>>>> where ,\,P\\,\,y|]>>. Indeed, it suffices to rewrite the system as <\eqnarray*> >||+P,\,y|)>>>||>|>|>||+P,\,y|)>,>>>> after which one may check that the coefficients ,\,y> only depend on earlier coefficients when evaluating the right-hand sides. For any given truncation order , we may record the resolution process as an SLP. The power series solution naturally gives rise to a\Pstepper\Q for the numerical integration of the system. However, the size of the SLP increases with : the factor is > for the naive method, 3>|)>> for Karatsuba's method, and > for large (see for asymptotically fast methods). It is an interesting problem to rewrite this SLP as multiple applications of a few SLPs of significantly smaller sizes (about > SLPs of the size of the original SLP might do, by associating one type of SLP to each size of squares in3> and doing the required transforms separately). This also raises the practical challenge of generating SLPs that call other SLPs: another example of control sugar. Traditional Runge-Kutta methods for the numerical integration of ODEs are also very suitable for the SLP framework. They have the advantage that they just require an SLP for the evaluation of the polynomials ,\,P>. However, efficient Runge\UKutta schemes are only known up till order 14>: for larger orders, we need many evaluations, so Taylor series methods are currently more efficient. In section we already discussed the generation of SLPs for arithmetic operations with extended precision. The typical application that we had in mind there was to allow for SLPs over numerical domains with a fixed higher precision. In order to support integers and rational numbers as domains >, we also need to consider mixed precisions. The same holds when solving some equation over high precision floating point numbers using Newton's method: in this case, it is best to double the working precision at every iteration. The most straightforward way to compute with mixed precisions is to use asimilar interface as or : the precision is either part of the data structure or passed as aparameter. Consider for instance the multiplication of two integers of limbs. For small values of , we may automatically generate a dedicated routine, similarly as in section. We may still do this for \Pnot so small values\Q of , in which case we may unroll Karatsuba's and Toom's more sophisticated multiplication algorithms. For larger , we may recurse down to these base cases for which we generated dedicated implementations. For very high precisions, we may resort to FFTmultiplication, where the appropriate number theoretic transforms are automatically generated using the techniques from section. A similar strategy can be applied for the multiplication of two integers of different sizes in limbs. It is a good question whether the general master routine for integer multiplication should also be generated automatically by the SLP library. It is not unreasonable to require that the implementation of such routines should be done in a separate higher level library. But it might also be interesting if the SLP library could automatically generate afull library like itself. However, this will require further control sugar, since most master routines for arithmetic operations will involve branching on the limb size, potentially using a lookup table. In fact, we may even build and cache the machine code for a certain size only when it is needed. Another mixed precision scenario is the evaluation of an SLP over the integers =\>, where we have bounds for the input arguments. In that case, we can statically determine bounds for the results of all instructions and their sizes in limbs. We may next replace each instruction by a dedicated SLP for the corresponding number of limbs. Consider for instance the computation of the product *\*x> of integers ,\,x> thatall fit into one limb. This can efficiently be done using the technique of : for , we compute \x\x>, \x\x>, \x\x>, \x\x> using an SLP for 1> limb multiplication, \x\x> and \x\x> using an SLP for 2>limb multiplication, and \x\x> using an SLP for 4> limb multiplication. A major shortcoming of existing multiple precision libraries like and is that they do not fully exploit SIMD parallelism. In order to do so, one should start with an extension of the user interface, by providing vector versions of all basic routines. E.g., multiplying integers at a time: \y,\,x\y|)>\,\,x|)>\,\,y|)>>. This is not entirely convenient, because an SIMD routine for doing this requires all > to be of the same limb size and similarly for the >. But there exist many applications for which the sizes of integers are indeed predictable and uniform. The SLP framework naturally allows us to speed up such applications by generating SIMD versions of operations onintegers. Last but not least, we hope that our new SLP libraries will be useful in order to speed up polynomial system solving using the method of . This method can be thought of as a purely algebraic counterpart of homotopy continuation from section, although the technical details are very different. It was first developed by Joos and his collaborators from a theoretical complexity standpoint. The second author and collaborators further improved the method and also implemented it. The aim is again to solve a system of polynomial equations(), but we now work over an effective field > instead of >. The solutions typically lie in an extension field > of >. A primitive representation =\/|)>> of > is therefore computed along with the solutions itself, where us a monic irreducible polynomial. For sufficiently generic systems, the degree of coincides with the number of solutions and all solutions can be represented simultaneously a single point in > (each root of giving rise to an actual solution). Unlike numerical homotopy continuation, the geometric resolution method proceeds by an induction over the dimension. Assume that we have a solution of the >dimensional system <\equation> P,\,z,0|)>=\=P,\,z,0|)>=0 over =\/|)>>. Then the idea is to first lift it into a solution of <\equation> P,\,z,t|)>=\=P,\,z,t|)>=0, over |]>>, where is a time parameter as in section. Expansions of order =deg q> in will generically suffice. We next intersect the \Psolution curve\Q ,\,z|)>> of the system() with the hypersurface of zeros of >. It turns out that the resulting equation <\equation*> P,\,z,t|)>=0 can be solved using one big resultant computation. If =\>, then it is also profitable to first reduce the system modulo a small and suitable \Pmachine\Q prime , solve the resulting system over >, Hensel lift the solution to /> \|)>>, and finally obtain the desired solution using rational number reconstruction5.10>. In order to implement this strategy efficiently, the following basic building blocks are particularly crucial: fast univariate polynomial arithmetic over >, fast arithmetic modulo, efficient evaluations of SLPs over \\/> and |]>/|)>>, efficient algorithms for resultants and subresultants, and variants of these building blocks when working over />*\|)>>> instead of >. Now univariate polynomial arithmetic over > or />*\|)>> is very similar to integer arithmetic. Using a similar approach as in the previous subsection, we may benefit from the SLP framework to automatically generate the required routines. If the degree of gets large, then multiplications in =\/> are done using fast Fourier methods. When evaluating an SLP over >, a lot of the required DFTs can often be shared, which significantly reduces the cost. Similarly, reductions modulo can often be postponed to the end. For instance, the 2> matrix product() can be lifted from =\> to =\> as follows: we first convert the input arguments ,\,x> into the FFT representation using DFTs, then perform the 2> matrix multiplication while using the FFT representation, next convert the result back into the usual polynomial representation, and finally reduce modulo . In order to implement such optimizations within our SLP framework, it is convenient to extend the signature> with converters between the polynomial and FFT representations and with operations on the FFT representations. This will allow us to lift() as follows from > to >: <\equation*> ||||||||||||||||||||||||,\,x|)>>||>||>>|\DFT|)>>||\|^>>||\DFT|)>>>|\DFT|)>>||\|^>>||\DFT|)>>>|\DFT|)>>||\|^>>||\DFT|)>>>|\DFT|)>>||\|^>>||\DFT|)>>>|\DFT|)>>||\,,|)>>||\reduce|)>>>|\DFT|)>>||\,,|)>>||\reduce|)>>>|\DFT|)>>||\,,|)>>||\reduce|)>>>|\DFT|)>>||\,,|)>>||\reduce|)>>>|>||>||,x,x,x|)>>>>>> Here elements of > are represented redundantly by polynomials of degree |2*d|\>>, where deg q>. The operation \Preduce\Q computes the remainder of the division by. The input arguments are assumed to be reduced. The map 2*d>\\> computes adiscrete Fourier transform of length (assuming that > has a >-th primitive root of unity), where 2*d>\\\deg A\2*d|}>>. This map is >-linear and we have =DFT|^>DFT> for polynomials \d>>. We noted |^>> and > for the vector lifts of > and to >. \ In fact, the cost of the reductions (which also require FFT multiplications) can be further reduced by switching to a polynomial analogue of the Montgomery representation for elements of >. One may also consider Fourier representations of larger sizes , , : this increases the cost of individual transforms, but may dramatically reduce the number of transforms that need to be done. We plan to further work out the details of these ideas in a forthcoming paper. As stated in the introduction, our previous papers internally rely on the library. In particular, the benchmarks in these papers provide early evidence that the SLP approach can greatly boost performance. In this section, we mainly report on preliminary timings for the library, for an experimental >-version that has just been released. We tested our libraries on the following platforms: >An desktop computer with an W-2140B processor. This processor has 8 cores that run at a clock speed of 3.2 GHz(4.2 GHz turbo boost). It supports 512 bit wide SIMD AVX-512 instructions. The machine ships with an AMD 56 graphics card. This GPU has 3584 cores and runs at a clock speed of 1.138GHz (1.25GHz boost frequency). Its theoretical peak performance at double precision is 560GFlops. >A workstation with an 7950X3D processor. This processor has 16 cores that run at a clock speed of 4.2 GHz(5.7 GHz turbo boost). It supports 512 bit wide SIMD AVX-512 instructions, although multiplications are \Pdouble-pumped\Q. This machine also incorporates an RTX 4070 graphics card. This GPU has 7168 cores and runs at a clock speed of 1.98 GHz (2.475 GHz boost frequency). Its theoretical peak performance at double precision is 554.4GFlops. An laptop with an M3 processor. This processor has 6 performance cores and 6 efficiency cores. It runs at a clock speed of 4.05 GHz and supports 128 bit wide SIMD instructions via ARM . > Our first benchmark concerns the computation of determinants of n> matrices. We consider two types of input SLPs for computing this determinant: <\itemize> The naive SLP, which adds up all terms when expanding the determinant. Note that this method requires no divisions, contrary to the asymptotically faster method based on Gaussian elimination. The simplification of the naive SLP. The size of this SLP is still exponential as afunction of , but much smaller than the naive SLP. The Tables, , and have been subdivided into two main left-hand and right-hand parts which correspond to the naive and simplified SLPs, respectively. The first\Plen\Q column shows the number of instructions of the input SLP. Each other column reports the timing of a specific operation on the input SLP in cycles per instruction: <\description-aligned> Simplification of the SLP. >>Computing the gradient of the SLP. Lifting the SLP to the complexification |]>> of the domain >. Register allocation, using registers. Complete machine code generation, including the register allocation. Executing the generated machine code over (scalar) double precision numbers. Tables, , and report timings for our three test platforms>, , and, respectively.|||||||||||||||||||>||||>>||||||||>>||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>>>>>> Timings in cycles per instruction on the test platform for various operations on two types of input SLPs to compute determinants of n> matrices. > ||||||||||||||||>||||>>||||||||>>||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>>>>>> Timings in cycles per instruction on the AMD test platform for various operations on two types of input SLPs to compute determinants of n> matrices. > |||||||||||||||||||||>||||>>||||||||>>||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>|||||||||||||||||>>>>>> Timings in cycles per instruction on the test platform for various operations on two types of input SLPs to compute determinants of n> matrices. > One major observation concerning these tables is that all operations on SLPs require between 100 and 1000 cycles per instruction, approximately. The older Intel processor is a bit slower than the more recent AMD and M3 processors. Machine code generation is significantly faster for the M3, which is due to the simpler instruction encoding for this ARM RISC processor. For small values of , the timings involve a large overhead. This overhead tends to disappear for SLPs with more than 100 instructions. The timings remain relatively stable in the midrange, for SLPs of lengths between 100 and 100000. Very large SLPs cause cache problems and we observe a significant overhead for such SLPs. For input SLPs of midrange lengths, we note that instructions tend to execute in less than one cycle. This is due to out of order execution by multiple execution units, which is particularly well done on M3 processors. We also note that all cycle times are reported with respect to the base clock speed. We recall that domain for all timings in Tables, , and is the scalar domain =\> of double precision floating point numbers. For many applications, it is possible to work over the SIMD domain =\> for some >, in which case multiple executions are accelerated by an approximately factor of with respect to =\>. This allows the and platforms (on which >) to make up for the smaller number of execution units with respect to the platform(on which ). See5.1> for more details on this point and further benchmarks. Let us finally examine the kind of code that is generated by our SLP simplifier. In Figure, we show the simplified SLP for computing a 3> determinant, as well as the corresponding machine code on X86 platforms. We observe that the code makes efficient use of fused multiply add instructions. The generated machine code also tends to exploit the indirect addressing mode for arithmetic instructions whose arguments are only required once. For small values of , the simplified SLPs are quite competitive with Gaussian elimination. They have the advantage that they require no divisions and no conditional code to find appropriate pivots. <\wide-tabular> || <\equation*> ,x,x,x,x,x,x,x,x|)>>>|\x\x>>|\fnma,x,x|)>>>|\x\x>>|\x\x>>|\fnma,x,x|)>>>|\fnma,x,x|)>>>|\x\x>>|\fnma,x,x|)>>>|\fma,x,x|)>>>||)>>>>>> |<\cell> <\small> <\code*> <\with|par-mode|left> vmovsd \ \ \ \ \ \ xmm1, qword ptr [rsi + 0x20] vmovsd \ \ \ \ \ \ xmm2, qword ptr [rsi + 0x40] vmulsd \ \ \ \ \ \ xmm0, xmm1, xmm2 vmovsd \ \ \ \ \ \ xmm3, qword ptr [rsi + 0x28] vmovsd \ \ \ \ \ \ xmm4, qword ptr [rsi + 0x38] vfnmadd231sd xmm0, xmm3, xmm4 vmulsd \ \ \ \ \ \ xmm5, xmm0, qword ptr [rsi] vmovsd \ \ \ \ \ \ xmm8, qword ptr [rsi + 0x18] vmulsd \ \ \ \ \ \ xmm7, xmm8, xmm2 vmovsd \ \ \ \ \ \ xmm9, qword ptr [rsi + 0x30] vfnmadd231sd xmm7, xmm3, xmm9 vfnmadd231sd xmm5, xmm7, qword ptr [rsi + 8] vmulsd \ \ \ \ \ \ xmm11, xmm8, xmm4 vfnmadd231sd xmm11, xmm1, xmm9 vfmadd231sd \ xmm5, xmm11, qword ptr [rsi + 0x10] vmovsd \ \ \ \ \ \ qword ptr [rdi], xmm5 ret \ \ \ \ \ \ \ \ \ >>> <|big-figure> A simplified SLP for computing 3> determinants and the generated X86 machine code. > One major motivation for our work on SLP libraries is the development of a fast tool for numerical homotopy continuation at double precision. We did a first prototype implementation in , based on the library. This experimental code uses in a complex way and is not suitable for public distribution. The SIMD width is limited to 256 bits (four double precision numbers), but we implemented a second order predictor-corrector scheme, with carefully fine-tuned parameters. We recently started a C++ reimplementation inside the library. This code supports the full 512 bit SIMD width (eight double precision numbers), but, sofar,we only implemented the most straightforward first order stepper(). In Table, we reported timings for the classical Katsura> benchmark. We also consider a few dense random systems Posso> of dimension and total degree . The first\Psols\Q column shows the total number of solutions. The columns labeled by \Pjit\Q indicate the time spent on building the machine code. The columns labeled by \Pexe>\Q show the actual time that was needed to solve the system using threads.||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||, >|||, >||, >||, >||>||||>|>||>||>||>|>>|>||s>|1ms>|1ms>|ms>|ms>|ms>|ms>|ms>|\s>|\s>>|>||s>|1ms>|1ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>|>||s>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>|>||s>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>|>||s>|s>|ms>|ms>|s>|ms>|ms>|ms>|ms>|ms>>|>||s>|s>|s>|ms>|s>|ms>|s>|ms>|s>|s>>|>||s>|s>|s>|ms>|s>|ms>|s>|ms>|s>|s>>|>||s>|s>|s>|s>|s>|s>|s>|ms>|s>|s>>|>||s>|s>|s>|s>|s>|s>|s>|s>|s>|s>>|>||s>|s>|s>|s>|s>|s>|s>|s>|s>|s>>|>||s>|1ms>|1ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>|>||s>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>|ms>>|>||s>|ms>|ms>|ms>|s>|ms>|ms>|ms>|s>|s>>>>>>> Solving polynomial systems using homotopy continuation at double precision. > The main benefit of switching to our lower level implementation in is adramatic speed-up (between 10 and 100, approximately) for the building time of machine code: the major problem for our implementation is indeed that the building time exceeds the parallel execution time for systems with up till around 100000 solutions. Nonetheless, despite the halved SIMD width, the running times for the solver itself are very competitive. This shows that there is still room for algorithmic improvements in our present implementation. More extensive benchmarking reveals that the performance scales well with the number of threads and the SIMD width on our test platforms. The last two columns of Table show that this even holds when passing from 6 to 12 threads on the processor: for the particular application of path tracking, efficiency cores perform almost as well as performance cores. Curiously, on the platform, our implementation suffers from a large overhead for systems with less than1000> solutions; we are still investigating the reason for this. One major objective of the library is to also support GPUs besides CPUs. Work in this direction has only started recently with the implementation of a rudimentary interface. In Table, we report some first timings. We note that the overhead of the compiler is very large, typically between 100 and 1000. As a consequence, the building time dominates the cost except for very large systems. Wealso observed disappointing execution times, despite the \Pembarrassingly parallel\Q nature of our application. This is partly due to the fact that we work with double instead of single precision. But we are still far from the theoretical peak performance, even for double precision. also seems to be a suboptimal choice. On cards, a native or interface would probably be more efficient.||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||>||||>|||>||||>||>||>||>>|>||ms>|ms>|s>|s>|ms>|ms>|s>|ms>>|>||ms>|ms>|s>|s>|ms>|ms>|s>|ms>>|>||ms>|ms>|s>|s>|ms>|ms>|s>|ms>>|>||ms>|ms>|s>|s>|ms>|ms>|s>|s>>|>||ms>|s>|s>|s>|ms>|ms>|s>|s>>|>||ms>|s>|s>|s>|ms>|s>|s>|s>>|>||ms>|s>|s>|s>|ms>|s>|s>|s>>|>||s>|s>|s>|s>|s>|s>|s>|s>>|>||s>|s>|s>|s>|s>|s>|s>|s>>|>||s>|s>|s>|s>|s>|s>|s>|s>>|>||ms>|ms>|s>|ms>|ms>|ms>|s>|ms>>|>||ms>|ms>|s>|s>|ms>|ms>|s>|s>>|>||ms>|s>|s>|s>|ms>|ms>|s>|s>>>>>>> Comparative timings for resolution on CPUs and GPUs. The subscript of exe> indicates the number of threads that we are using. > <\bibliography|bib|tm-plain|> <\bib-list|106> A.Ahlbäck, J.vander HoevenG.Lecerf. , 2025. A.AhlbäckF.Johansson. Fast basecases for arbitrary-size multiplication. , 53\U60. 2025. A.V.Aho, M.S.Lam, R.SethiJ.D.Ullman. . Pearson Education, 2nd, 2007. Georg,K.Allgower, E. L. . Springer-Verlag, 1990. ANSI/IEEE. IEEE standard for binary floating-point arithmetic. , ANSI/IEEE, New York, 2008. ANSI-IEEE Standard 754-2008. Revision of IEEE 754-1985, approved on June 12, 2008 by IEEE Standards Board. ARM corporation. Fast models user guide. . Version 11.29. M.Aschenbrenner, L.vanden DriesJ.vander Hoeven. . 195Annals of Mathematics studies. Princeton University Press, 2017. AsmJit: low-latency machine code generation. . Autodiff: automatic differentiation in C++ couldn't be simpler. . Torch.autograd: automatic differentiation package in pytorch. . Awesome-JIT. . Ressources for JIT compilation. D.H.Bailey, Y.HidaX.S.Li. Algorithms for quad-double precision floating point arithmetic. , 155\U162. IEEE, 2001. H.G.Baker. Computing A*B (mod N) efficiently in ANSI C. , 27(1):95\U98, 1992. W.BaurV.Strassen. The complexity of partial derivatives. , 22:317\U330, 1983. Blas (basic linear algebra subprograms). . J.L.Bordewijk. Inter-reciprocity applied to electrical networks. , 6:1\U74, 1956. A.Bostan, G.LecerfÉ.Schost. Tellegen's principle into practice. , ISSAC '03, 37\U44. New York, NY, USA, 2003. ACM Press. J.Bradbury, N.DruckerM.Hillenbrand. NTT software optimization using an extended Harvey butterfly. , 2021. . P.Briggs, K.D.CooperL.Torczon. Rematerialization. , PLDI '92, 311\U321. New York, NY, USA, 1992. ACM. N.Bruno, J.Heintz, G.MateraR.Wachenchauzer. Functional programming concepts and straight-line programs in computer algebra. , 60(6):423\U473, 2002. P.Bürgisser, M.ClausenM.A.Shokrollahi. . Springer-Verlag, Berlin, 1997. B.Castaño, J.Heintz, J.LlovetR.Martìnez. On the data structure straight-line program and its implementation in symbolic computation. , 51(5):497\U528, 2000. G.J.Chaitin. Register allocation & spilling via graph coloring. , 17(6):98\U101, 1982. G.J.Chaitin, M.A.Auslander, A.K.Chandra, J.Cocke, M.E.HopkinsP.W.Markstein. Register allocation via coloring. Comput. Lang.>, 6(1):47\U57, 1981. T.J.Dekker. A floating-point technique for extending the available precision. , 18(3):224\U242, 1971. A.DíazE.Kaltofen. : a system for manipulating symbolic objects in black box representation. , ISSAC '98, 30\U37. New York, NY, USA, 1998. ACM. J.-G.Dumas, T.Gautier, M.Giesbrecht, P.Giorgi, B.Hovinen, E.Kaltofen, B.D.Saunders, W.J.TurnerG.Villard. LinBox: a generic library for exact linear algebra. A.M.Cohen, X.S.GaoN.Takayama, , 40\U50. World Scientific, 2002. . J.-G.Dumas, P.GiorgiC.Pernet. FFPACK: finite field linear algebra package. J.Schicho, , ISSAC '04, 119\U126. New York, NY, USA, 2004. ACM. J.-G.Dumas, P.GiorgiC.Pernet. Dense linear algebra over word-size prime fields: the FFLAS and FFPACK packages. , 35(3):19\U1, 2008. C.DurvyeG.Lecerf. A concise proof of the Kronecker polynomial system solver from scratch. , 26(2):101\U139, 2008. T.EdamatsuD.Takahashi. Accelerating large integer multiplication using intel AVX-512IFMA. , 60\U74. Springer-Verlag, 2019. Ch.M.Fiduccia. . , Brown University, 1973. N.Fitchas, M.GiustiF.Smietanski. Sur la complexité du théorème des zéros. M.FlorenzanoJ.Guddat, , 8, 274\U329. Frankfurt am Main, 1995. Lang. A.Fog. Software optimization resources. . L.Fousse, G.Hanrot, V.Lefèvre, P.PélissierP.Zimmermann. MPFR: a multiple-precision binary floating-point library with correct rounding. , 33(2), 2007. Software available at . T.S.Freeman, G.M.Imirzian, E.KaltofenY.Lakshman. DAGWOOD: a system for manipulating polynomials given by straight-line programs. , 14:218\U240, 1988. M.Frigo. A fast Fourier transform compiler. , 34, 169\U180. ACM, May 1999. M.FrigoS.G.Johnson. The design and implementation of FFTW3. , 93(2):216\U231, 2005. M.Frigo, C.E.Leiserson, H.ProkopS.Ramachandran. Cache-oblivious algorithms. , 285\U297. IEEE, 1999. J.vonzur GathenJ.Gerhard. . Cambridge University Press, 3rd, 2013. M.Giusti, K.Hägele, J.Heintz, J.L.Montaña, J.E.MoraisL.M.Pardo. Lower bounds for diophantine approximations. , 117/118:277\U317, 1997. M.Giusti, K.Hägele, G.Lecerf, J.MarchandB.Salvy. The projective Noether Maple package: computing the dimension of a projective variety. , 30(3):291\U307, 2000. M.GiustiJ.Heintz. La détermination des points isolés et de la dimension d'une variété algébrique peut se faire en temps polynomial. D.EisenbudL.Robbiano, , Sympos. Math., XXXIV, 216\U256. Cambridge Univ. Press, 1993. M.Giusti, J.Heintz, J.E.Morais, J.MorgensternL.M.Pardo. Straight-line programs in geometric elimination theory. , 124(1-3):101\U146, 1998. M.Giusti, J.Heintz, J.E.MoraisL.M.Pardo. When polynomial equation systems can be ``solved'' fast? G.Cohen, M.GiustiT.Mora, , 948, 205\U231. Springer-Verlag, 1995. M.Giusti, J.Heintz, J.E.MoraisL.M.Pardo. Le rôle des structures de données dans les problèmes d'élimination. , 325(11):1223\U1228, 1997. M.Giusti, J.HeintzJ.Sabia. On the efficiency of effective Nullstellensätze. , 3(1):56\U95, 1993. M.Giusti, G.LecerfB.Salvy. A Gröbner free alternative for polynomial system solving. , 17(1):154\U211, 2001. T.Granlund etal. GMP, the GNU multiple precision arithmetic library. 1991. . S.GueronV.Krasnov. Accelerating big integer arithmetic using intel IFMA extensions. , 32\U38. 2016. K.Hägele, J.E.Morais, L.M.PardoM.Sombra. On the intrinsic complexity of the arithmetic Nullstellensatz. , 146(2):103\U183, 2000. G.Hanrot, M.QuerciaP.Zimmermann. The middle product algorithmI. Speeding up the division and square root of power series. , 14:415\U438, 2004. W.Hart, F.JohanssonS.Pancratz. FLINT: Fast Library for Number Theory. 2013. Version 2.4.0, . D.Harvey. Faster arithmetic for number-theoretic transforms. , 60:113\U119, 2014. J.HeintzC.-P.Schnorr. Testing polynomials which are easy to compute. , 30, 237\U254. Geneva, 1982. Univ. Genève. Jvander Hoeven. The truncated Fourier transform and applications. J.Schicho, , ISSAC '04, 290\U296. New York, NY, USA, 2004. ACM. J.vander Hoeven. Relax, but don't be too lazy. Symbolic Comput.>, 34:479\U542, 2002. J.vander Hoeven. Notes on the Truncated Fourier Transform. 2005-5, Université Paris-Sud, Orsay, France, 2005. J.vander Hoeven. Ball arithmetic. , HAL, 2009. . J.vander Hoeven. Ball arithmetic. A.Beckmann, Ch.GaÿnerB.Löwe, , 6Preprint-Reihe Mathematik, 179\U208. Ernst-Moritz-Arndt-Universität Greifswald, 2010. J.vander Hoeven. Calcul analytique. , 1, 1\U85. CIRM, 2011. . J.vander Hoeven. Reliable homotopy continuation. , HAL, 2011. . J.vander Hoeven. Faster relaxed multiplication. , ISSAC '14, 405\U412. New York, NY, USA, 2014. ACM. J.vander Hoeven. Multiple precision floating-point arithmetic on SIMD processors. , 2\U9. July 2017. J.vander Hoeven. . Scypress, 2020. J.vander Hoeven, R.LarrieuG.Lecerf. Implementing fast carryless multiplication. , 10693, 121\U136. Springer International Publishing, 2017. J.vander HoevenG.Lecerf. Interfacing Mathemagix with C++. , ISSAC '13, 363\U370. New York, NY, USA, 2013. ACM. J.vander HoevenG.Lecerf. . HAL, 2013. . J.vander HoevenG.Lecerf. Faster FFTs in medium precision. , 75\U82. June 2015. J.vander HoevenG.Lecerf. Evaluating straight-line programs over balls. P.Montuschi, M.Schulte, J.Hormigo, S.ObermanN.Revol, , 142\U149. IEEE, 2016. J.vander HoevenG.Lecerf. On the complexity exponent of polynomial system solving. , 21:1\U57, 2021. J.vander HoevenG.Lecerf. Implementing number theoretic transforms. , HAL, 2024. . J.vander Hoeven, G.LecerfG.Quintin. Modular SIMD arithmetic in Mathemagix. , 43(1):5\U1, 2016. J.vander Hoeven etal. Mathemagix. 2002. . Intel corporation. Intel\ 64 and IA-32 architectures software developer's manual. . F.Johansson. Arb: a C library for ball arithmetic. , 47(3/4):166\U169, 2014. M.Jolde³, O.Marty, J.-M.MullerV.Popescu. Arithmetic algorithms for extended precision using floating-point expansions. , 65(4):1197\U1210, 2015. M.Joldes, J.-M.Muller, V.PopescuW.Tucker. CAMPARY: cuda multiple precision arithmetic library and applications. , 232\U240. Cham, 2016. Springer. Juliadiff: differentiation tools in Julia. . V.KabanetsR.Impagliazzo. Derandomizing polynomial identity tests means proving circuit lower bounds. , 13:1\U46, 2004. E.KaltofenB.M.Trager. Computing with polynomials given by black boxes for their evaluations: greatest common divisors, factorization, separation of numerators and denominators. Symb. Comput.>, 9(3):301\U320, 1990. A.KaratsubaJ.Ofman. Multiplication of multidigit numbers on automata. , 7:595\U596, 1963. R.B.Kearfott. An interval step control for continuation methods. , 31(3):892\U914, 1994. R.Larrieu. The truncated Fourier transform for mixed radices. M.Burr, , ISSAC '17, 261\U268. New York, NY, USA, 2017. ACM. G.Lecerf. Computing an equidimensional decomposition of an algebraic variety by means of geometric resolutions. , ISSAC '00, 209\U216. New York, NY, USA, 2000. ACM. G.Lecerf. |\>>es algébriques>. , École polytechnique, 2001. G.Matera. Probabilistic algorithms for geometric elimination. , 9(6):463\U520, 1999. P.Mih ilescu. Fast convolutions meet montgomery. , 77(262):1199\U1221, 2008. O.Møller. Quasi double-precision in floating point addition. , 5:37\U50, 1965. P.L.Montgomery. . , U. of California at Los Angeles, 1992. R.E.Moore. . Prentice Hall, Englewood Cliffs, N.J., 1966. R.E.Moore, R.B.KearfottM.J.Cloud. . Society for Industrial and Applied Mathematics, 2009. C.D.MurrayR.R.Williams. On the (non) NP-Hardness of computing circuit complexity. , 13(4):1\U22, 2017. H.J.NussbaumerP.Quandalle. Computation of convolutions and discrete Fourier transforms by polynomial transforms. , 22(2):134\U144, 1978. LLVM ORC design and implementation. . M.PolettoV.Sarkar. Linear scan register allocation. , 21(5):895\U913, 1999. J.M.Pollard. The fast Fourier transform in a finite field. , 25(114):365\U374, 1971. W.H.Press, S.A.Teukolsky, W.T.VetterlingB.P.Flannery. . Cambridge University Press, 3rd, 2007. D.M.Priest. . , University of California at Berkeley, 1992. M.Püschel, J.M.F.Moura, J.Johnson, D.Padua, M.Veloso, B.Singer, J.Xiong, F.Franchetti, A.Gacic, Y.Voronenko, K.Chen, R.W.JohnsonN.Rizzolo. SPIRAL: code generation for DSP transforms. , 93(2):232\U275, 2005. D.S.Roche. What can (and can't) we do with sparse polynomials? C.Arreche, , ISSAC '18, 25\U30. New York, NY, USA, 2018. ACM Press. S.M.Rump. Verification methods: rigorous results using floating-point arithmetic. , 19:287\U449, 2010. A.ShpilkaA.Yehudayoff. Arithmetic circuits: a survey of recent results and open questions. , 5(3\U4):207\U388, 2010. A.J.Sommese, J.VerscheldeC.W.Wampler. , 301\U337. Springer, 2005. D.Takahashi. An implementation of parallel number-theoretic transform using Intel AVX-512 instructions. F.Boulier, M.England, T.M.SadykovE.V.Vorozhtsov, , 13366, 318\U332. Springer, Cham, 2022. A.L.Toom. The complexity of a scheme of functional elements realizing the multiplication of integers. , 4(2):714\U716, 1963. <\initial> <\collection> <\attachments> <\collection> <\associate|bib-bibliography> <\db-entry|+2OPYiSOf16ttyRCO|book|TeXmacs:vdH:book> <|db-entry> > <\db-entry|+2PefpYjV18MNUIj2|book|ALSU07> <|db-entry> M. S R. J. D. > <\db-entry|+qYhUkmR1lNMNuwB|article|BS83> <|db-entry> Volker > <\db-entry|+1jEVhN2J2Jo3FGwd|inproceedings|BoLeSc03> <|db-entry> G. É. > <\db-entry|+DqzvxB96E8VtLh|book|GaGe2013> <|db-entry> J. > <\db-entry|+qYhUkmR1lNMNv8R|inproceedings|Roche2018> <|db-entry> > <\db-entry|+qYhUkmR1lNMNuua|book|BCS97> <|db-entry> M. M. A. > <\db-entry|+11kKTCYk2Q6o1vKG|inproceedings|HeintzSchnorr1980> <|db-entry> C.-P. > <\db-entry|+2ATyn1VI7MJ|inproceedings|GiHe1993> <|db-entry> J. > L. > <\db-entry|+2mBymFTkuD|article|GiHeSa1993> <|db-entry> J. J. > <\db-entry|+2ATyn1VI7MO|inproceedings|FiGiSm1995> <|db-entry> M. F. > J. > <\db-entry|+2mBymFTku8|article|GiHaHeMoMoPa1997> <|db-entry> K. J. J. L. J. E. L. M. > <\db-entry|+2mBymFTkuB|article|GiHeMoMoPa1998> <|db-entry> J. J. E. J. L. M. > <\db-entry|+2mBymFTkuC|article|GiHeMoPa1997> <|db-entry> J. J. E. L. M. > <\db-entry|+2mBymFTkuF|article|HaMoPaSo2000> <|db-entry> J. E. L. M. M. > <\db-entry|+2ATyn1VI7MI|article|Matera1999> <|db-entry> > <\db-entry|+2mBymFTku6|article|CaHeLlMa2000> <|db-entry> J. J. R. > <\db-entry|+6h5lF4voxy|article|BrHeMaWa2002> <|db-entry> J. G. R. > <\db-entry|+2mBymFTku9|article|GiHaLeMaSa2000> <|db-entry> K. G. J. B. > <\db-entry|+2mBymFTkuE|article|GiLeSa2001> <|db-entry> G. B. > <\db-entry|+1fiIudn12NUBlC7G|article|DurvyeLecerf2008> <|db-entry> G. > <\db-entry|+1CQ02y1d169CJ0q0|book|PTVF07> <|db-entry> S. A. W. T. B. P. > <\db-entry|+1ZAGXYG4238WXVut|article|KaltofenTrager1990> <|db-entry> B. M. > <\db-entry|+3E0M5F5LONB|inproceedings|DK98> <|db-entry> E. > : a system for manipulating symbolic objects in black box representation> <\db-entry|+aBpkOnY2VUnXTuN|article|FrImKaLa88> <|db-entry> G. M. E. Y. > <\db-entry|+qYhUkmR1lNMNv0Y|misc|GMP> <|db-entry> > > <\db-entry|+qYhUkmR1lNMNv60|misc|MPFR> <|db-entry> V. K. P. > > <\db-entry|+2OPYiSOf16ttyRCs|misc|BLAS> <|db-entry> > <\db-entry|+qYhUkmR1lNMNv4X|misc|LINBOX> <|db-entry> T. M. P. B. E. B. D. W. J. G. > > <\db-entry|+qYhUkmR1lNMNv2W|article|Joh14> <|db-entry> > <\db-entry|+qYhUkmR1lNMNvFz|misc|vdH:mmx> <|db-entry> G. B. > > <\db-entry|+haC4GcdToVz5GA|inproceedings|vdH:dagball> <|db-entry> G. > <\db-entry|+3E0M5F5LONP|inproceedings|vdH:ff2mul> <|db-entry> R. G. > <\db-entry|+qYhUkmR1lNMNvEe|book|vdH:mt> <|db-entry> L. van den J. van der > <\db-entry|+2OPYiSOf16ttyRCr|article|KI04> <|db-entry> R. > <\db-entry|+qYhUkmR1lNMNv2I|techreport|IEEE754> <|db-entry> > <\db-entry|+3E0M5F5LONA|article|Dek71> <|db-entry> > <\db-entry|+qYhUkmR1lNMNv7V|phdthesis|Priest:phd> <|db-entry> > <\db-entry|+haC4GcdToVz5G7|inproceedings|vdH:quad> <|db-entry> G. > <\db-entry|+3E0M5F5LONK|article|JMMP15> <|db-entry> O. J.-M. V. > <\db-entry|+SAPWFeG24koBchJ|article|Baker1992> <|db-entry> > <\db-entry|+qYhUkmR1lNMNvEL|article|vdH:simd> <|db-entry> G. G. > <\db-entry|+2O4vhK4K1PLAVT8|misc|FLINT:13> <|db-entry> F. S. > > <\db-entry|+1Hcl3U922Lc9q61S|manual|vdH:mmx:manual> <|db-entry> G. > > <\db-entry|+qYhUkmR1lNMNvEA|inproceedings|vdH:mmxcpp> <|db-entry> G. > <\db-entry|+2OPYiSOf16ttyRCl|misc|Intel:dev-manual> <|db-entry> > 64 and IA-32 architectures software developer's manual> > <\db-entry|+2OPYiSOf16ttyRCL|misc|ARM:fast-models> <|db-entry> > > <\db-entry|+2OPYiSOf16ttyRD3|misc|Fog:opt> <|db-entry> > > <\db-entry|+3E0M5F5LOND|inproceedings|FLPR99> <|db-entry> C. E. H. S. > <\db-entry|+UxLEI9PJlFl3Ur|article|vdH:relax> <|db-entry> > <\db-entry|+2OPYiSOf16ttyRCM|article|SY10> <|db-entry> A. > <\db-entry|+2OPYiSOf16ttyRCo|article|MW17> <|db-entry> R. R. > <\db-entry|+1G6eVA0xhdjEvaY|article|CACCHM81> <|db-entry> M. A. A. K. J. M. E. P. W. > <\db-entry|+1G6eVA0xhdjEvaZ|article|Chai82> <|db-entry> > <\db-entry|+1G6eVA0xhdjEvaX|inproceedings|BCT92> <|db-entry> K. D. L. > <\db-entry|+1G6eVA0xhdjEvaa|article|PS99> <|db-entry> V. > <\db-entry|+2OPYiSOf16ttyRD0|misc|AsmJit> <|db-entry> > <\db-entry|+2OPYiSOf16ttyRD1|misc|ORC> <|db-entry> > <\db-entry|+2OPYiSOf16ttyRD2|misc|AwesomeJit> <|db-entry> > <\db-entry|+qYhUkmR1lNMNvDr|inbook|vdH:jncf> <|db-entry> > > <\db-entry|+2OPYiSOf16ttyRCz|misc|Autodiff> <|db-entry> > <\db-entry|+2OPYiSOf16ttyRCx|misc|Autograd> <|db-entry> > <\db-entry|+2OPYiSOf16ttyRCy|misc|Juliadiff> <|db-entry> > <\db-entry|+qYhUkmR1lNMNuvl|article|Bor56> <|db-entry> > B: Electrophysics, Acoustics, Optics, Mathematical Methods> <\db-entry|+3EFPdOXnYc|phdthesis|Fid:phd> <|db-entry> > <\db-entry|+qYhUkmR1lNMNv1S|article|HaQuZi04> <|db-entry> M. P. > I. speeding up the division and square root of power series> <\db-entry|+2OPYiSOf16ttyRCn|inproceedings|vdH:ball:greifswald> <|db-entry> > Christine Bededikt > <\db-entry|+qYhUkmR1lNMNv5p|book|Moo66> <|db-entry> > <\db-entry|+VwIKfxGniejpqN|book|Moore2009> <|db-entry> R. Baker Michael J. > <\db-entry|+qYhUkmR1lNMNv8i|article|Rump10> <|db-entry> > <\db-entry|+1Hcl3U922Lc9q63B|techreport|vdH:ball> <|db-entry> > > <\db-entry|+33fX45UG6TG|inproceedings|DumasGiorgiPernet2004> <|db-entry> P. C. > > <\db-entry|+X7fSkONwiTAWqJ|article|DumasGiorgiPernet2008> <|db-entry> P. C. > <\db-entry|+qYhUkmR1lNMNv1c|article|Har14> <|db-entry> > <\db-entry|+33fX45UG6T8|article|BDH21> <|db-entry> N. M. > > <\db-entry|+33fX45UG6TM|inproceedings|Tak22> <|db-entry> > M. T.M. E.V. > <\db-entry|+1xar4EgHolnstl5|techreport|vdH:ntt> <|db-entry> G. > > <\db-entry|+29X0JvK0SBu|inproceedings|AJ25> <|db-entry> F. > <\db-entry|+CZZmVFYybXmQ9o|inproceedings|GK16> <|db-entry> V. > <\db-entry|+CZZmVFYybXmQ9n|inproceedings|ET19> <|db-entry> D. > <\db-entry|+haC4GcdToVz5GB|inproceedings|vdH:fpsimd> <|db-entry> > <\db-entry|+1tLr7jPliKWcLb6|article|Mol65> <|db-entry> > <\db-entry|+qYhUkmR1lNMNuv7|inproceedings|BHL01> <|db-entry> Y. X. S. > <\db-entry|+1DFvxG66wjx4qzD|inproceedings|CAMPARY> <|db-entry> J.-M. V. W. > <\db-entry|+2OPYiSOf16ttyRCp|book|AG90> <|db-entry> > <\db-entry|+2OPYiSOf16ttyRCq|inbook|SVW05> <|db-entry> J. C. W. > <\db-entry|+qYhUkmR1lNMNv3k|article|KX94> <|db-entry> > <\db-entry|+1Hcl3U922Lc9q63J|techreport|vdH:homotopy> <|db-entry> > > <\db-entry|+33fX45UG6TJ|inproceedings|Fri99> <|db-entry> > '99> <\db-entry|+1QvsQXLs1XLbdH0D|article|FJ05> <|db-entry> S.G. > <\db-entry|+33fX45UG6TE|article|Pue05> <|db-entry> J. M. F. J. D. M. B. J. F. A. Y. K. R. W. N. > <\db-entry|+1CQ02y1d169CJ0q3|inproceedings|vdH:tft> <|db-entry> > <\db-entry|+qYhUkmR1lNMNvCx|techreport|vdH:tft-note> <|db-entry> > <\db-entry|+29X0JvK0Rvt|inproceedings|Lar16> <|db-entry> > <\db-entry|+qYhUkmR1lNMNv7Q|article|Pol71> <|db-entry> > <\db-entry|+3E0M5F5LONN|article|NQ78> <|db-entry> P. > <\db-entry|+fpCsCK1XUc2blU|inproceedings|vdH:fastrelax> <|db-entry> > <\db-entry|+qYhUkmR1lNMNv2o|article|Kar63> <|db-entry> J. > <\db-entry|+qYhUkmR1lNMNvAe|article|Toom63b> <|db-entry> > <\db-entry|+2iZV2dP3jRq|inproceedings|GiHeMoPa1995> <|db-entry> J. J. E. L. M. > M. T. > <\db-entry|+qYhUkmR1lNMNv49|inproceedings|Lecerf:2000:genkro> <|db-entry> > <\db-entry|+1CQ02y1d169CJ0pu|phdthesis|Lecerf:phd> <|db-entry> > <\db-entry|+tw6KmG8XaVvRt|article|vdH:polexp> <|db-entry> G. > <\db-entry|+1WgtvVkP2Ip0Sbfj|article|Mih08> <|db-entry> > <\db-entry|+10eWAZTQ1LdVQCLb|phdthesis|Montgomery:phd> <|db-entry> > <\db-entry|+2OPYiSOf16ttyRD4|misc|JIL> <|db-entry> J. van der G. > > <\references> <\collection> > |Ryzen>|43>> |Xeon>|43>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > <\auxiliary> <\collection> <\associate|bib> TeXmacs:vdH:book ALSU07 BS83 BoLeSc03 GaGe2013 Roche2018 BCS97 HeintzSchnorr1980 GiHe1993 GiHeSa1993 FiGiSm1995 GiHaHeMoMoPa1997 GiHeMoMoPa1998 GiHeMoPa1997 HaMoPaSo2000 Matera1999 CaHeLlMa2000 BrHeMaWa2002 GiHe1993 GiHaLeMaSa2000 GiLeSa2001 DurvyeLecerf2008 PTVF07 KaltofenTrager1990 DK98 Roche2018 FrImKaLa88 CaHeLlMa2000 BrHeMaWa2002 GMP MPFR BLAS LINBOX Joh14 vdH:mmx vdH:dagball vdH:ff2mul BS83 BoLeSc03 vdH:dagball vdH:dagball vdH:ff2mul BCS97 vdH:dagball vdH:mt KI04 HeintzSchnorr1980 IEEE754 Dek71 Priest:phd vdH:quad JMMP15 Baker1992 vdH:simd FLINT:13 vdH:mmx:manual vdH:mmxcpp Intel:dev-manual ARM:fast-models Fog:opt FLPR99 vdH:relax BCS97 SY10 MW17 CACCHM81 CACCHM81 Chai82 BCT92 PS99 AsmJit ORC AwesomeJit vdH:jncf Autodiff Autograd Juliadiff BS83 Bor56 Fid:phd BoLeSc03 GaGe2013 HaQuZi04 vdH:ball:greifswald Joh14 Moo66 Moore2009 Rump10 vdH:dagball vdH:ball vdH:dagball vdH:dagball vdH:simd DumasGiorgiPernet2004 DumasGiorgiPernet2008 Har14 BDH21 Tak22 vdH:ntt GMP MPFR AJ25 GK16 ET19 vdH:quad vdH:quad vdH:fpsimd Mol65 Dek71 Priest:phd BHL01 CAMPARY AG90 SVW05 KX94 vdH:homotopy Fri99 FJ05 Pue05 vdH:ntt vdH:tft vdH:tft-note Lar16 Pol71 NQ78 vdH:quad vdH:ff2mul vdH:ntt Pue05 vdH:relax vdH:relax vdH:relax vdH:fastrelax vdH:relax Kar63 Toom63b Pol71 GiHe1993 GiHeMoPa1995 GiHaHeMoMoPa1997 GiHeMoMoPa1998 GiLeSa2001 Lecerf:2000:genkro Lecerf:phd DurvyeLecerf2008 vdH:polexp GaGe2013 Mih08 Montgomery:phd vdH:dagball vdH:ff2mul JIL vdH:ntt <\associate|figure> |7.1>|> A simplified SLP for computing |3\3> determinants and the generated X86 machine code. |> <\associate|table> |7.1>|> Timings in cycles per instruction on the |Intel> test platform for various operations on two types of input SLPs to compute determinants of |n\n> matrices. |> |7.2>|> Timings in cycles per instruction on the AMD test platform for various operations on two types of input SLPs to compute determinants of |n\n> matrices. |> |7.3>|> Timings in cycles per instruction on the |Apple> test platform for various operations on two types of input SLPs to compute determinants of |n\n> matrices. |> |7.4>|> Solving polynomial systems using homotopy continuation at double precision. |> |7.5>|> Comparative timings for resolution on CPUs and GPUs. The subscript |T> of exe|T>> indicates the number of threads that we are using. |> <\associate|toc> |math-font-series||font-shape||1.Introduction> |.>>>>|> |1.1.The framework of straight-line programs |.>>>>|> > |1.2.Design goals and overview of this paper |.>>>>|> > |math-font-series||font-shape||2.The SLP data type> |.>>>>|> |2.1.Abstract SLPs |.>>>>|> > |2.2.Choice of the signature |.>>>>|> > |2.3.Representation of domains |.>>>>|> > |2.4.Low-level representation of SLPs |.>>>>|> > |2.5.On the quality of SLPs |.>>>>|> > |math-font-series||font-shape||3.Recording SLPs> |.>>>>|> |3.1.Recorder types |.>>>>|> > |3.2.Recording |2\2> matrix multiplication |.>>>>|> > |3.3.Towards the direct generation of cleaner code |.>>>>|> > |3.4.A shifted design perspective |.>>>>|> > |math-font-series||font-shape||4.Traditional operations on SLPs> |.>>>>|> |4.1.Extracting basic information |.>>>>|> > |4.2.Tweaking the representation |.>>>>|> > |4.3.Common subexpression elimination |.>>>>|> > |4.4.Simplification |.>>>>|> > |4.5.Register allocation |.>>>>|> > |4.6.Instruction scheduling |.>>>>|> > |4.7.Emulation |.>>>>|> > |4.8.Building machine code on the fly |.>>>>|> > |4.9.Beyond SLPs |.>>>>|> > |math-font-series||font-shape||5.Algebraic transformations> |.>>>>|> |5.1.Vectorization |.>>>>|> > |5.2.Automatic differentiation |.>>>>|> > |5.3.Transposition |.>>>>|> > |5.4.Homogenization |.>>>>|> > |5.5.SLP algebras |.>>>>|> > |5.6.Ball arithmetic |.>>>>|> > |5.7.Modular arithmetic |.>>>>|> > |5.8.Extended precision arithmetic |.>>>>|> > |math-font-series||font-shape||6.Applications> |.>>>>|> |6.1.Homotopy continuation |.>>>>|> > |6.2.Discrete Fourier transforms |.>>>>|> > |6.3.Relaxed power series and differential equations |.>>>>|> > |6.4.Multiple precision arithmetic |.>>>>|> > |6.5.Geometric resolution |.>>>>|> > |math-font-series||font-shape||7.Benchmarks> |.>>>>|> |7.1.Test platforms |.>>>>|> > |7.2.Basic operations on SLPs |.>>>>|> > |7.3.Homotopy continuation |.>>>>|> > |math-font-series||font-shape||Bibliography> |.>>>>|>