
Abstract
Interval arithmetic achieves numerical reliability for a wide range of applications, at the price of a performance penalty. For applications to homotopy continuation, one key ingredient is the efficient and reliable evaluation of complex polynomials represented by straightline programs. This is best achieved using ball arithmetic, a variant of interval arithmetic. In this article, we describe strategies for reducing the performance penalty of basic operations on balls. We also show how to bound the effect of rounding errors at the global level of evaluating a straightline program. This allows us to introduce a new and faster “transient” variant of ball arithmetic.
Keywords: ball arithmetic, polynomial evaluation, software implementation
Interval arithmetic is a classical tool for making numerical computations reliable, by systematically computing interval enclosures for the desired results instead of numerical approximations [1, 10, 11, 13, 14, 17, 24]. Interval arithmetic has been applied with success in many areas, such as the resolution of systems of equations, homotopy continuations, reliable integration of dynamical systems, etc. There exist several variants of interval arithmetic such as ball arithmetic, interval slope arithmetic, Taylor models, etc. Depending on the context, these variants may be more efficient than standard interval arithmetic and/or provide tighter enclosures.
In this paper, we will mainly focus on ball arithmetic (also known as midpointradius interval arithmetic), which is particularly useful for reliable computations with complex numbers and multiple precision numbers [5]. One of our main motivations is the implementation of reliable numerical homotopy methods for polynomial system solving [2, 3, 6, 27, 28, 29]. One basic prerequisite for this project concerns the efficient and reliable evaluation of multivariate polynomials represented by so called straightline programs (SLPs).
Two classical disadvantages of interval arithmetic and its variants are the additional computational overhead and possible overestimation of errors. There is a tradeoff between these two evils: it is always possible to reduce the overestimation at the expense of a more costly variant of interval arithmetic (such as high order Taylor models). For a fixed variant, the computational overhead is usually finite, but it remains an important issue for high performance applications to reduce the involved constant factors as much as possible. In this paper, we will focus on this “overhead problem” in the case of basic ball arithmetic (and without knowledge about the derivatives of the functions being evaluated).
About ten years ago, we started the implementation of a basic C++
template library for ball arithmetic in the
In this article we investigate various strategies to reduce the overhead involved when evaluating straightline programs over balls. Our point of view is pragmatic and directed towards the development of more efficient implementations. We will not turn around the fact that the development of efficient ball arithmetic admits a quite technical side: the optimal answer greatly depends on available hardware features. We will consider on the following situations, encountered for modern processors:
Without specific IEEE 754 compliant hardware (which is frequently the case for GPUs), we may only assume faithful rounding, specifying a bound on relative errors for each operation, and that errors can be thrown on numerical exceptions.
For recent
Recent AVX512 processors propose floating point instructions which directly incorporate a rounding mode, thereby eliminating all need to switch rounding modes via the status register and all resulting delays caused by broken pipelines.
As a general rule, modern processors have also become highly parallel. For this reason, it is important to focus on algorithms that can easily be vectorized. The first contribution of this paper concerns various strategies for ball arithmetic as a function of the available hardware. From the conceptual point of view, this will be done in Section 2, whereas additional implementation details will be given in Section 4.
Our second main contribution is the introduction of transient ball arithmetic in Section 2.5 and its application to the evaluation of SLPs in Section 3. The idea is to not bother about rounding errors occurring when computing centers and radii of individual balls, which simplifies the implementation of the basic ring operations. Provided that the radii of the input balls are not too small and that neither overflow nor underflow occur, we will show that the cumulated effect of the ignored rounding errors can be bounded for the evaluation of a complete SLP. More precisely, we will show that the relative errors due to rounding are essentially dominated by the depth of the computation. Although it is most convenient to apply our result to SLPs, much of it can be generalized to arbitrary programs, by regarding the trace of the execution as an SLP. For such generalizations, the main requirement is to have an a priori bound for the (parallel) depth of the computation.
Most of our strategies have been implemented inside the
In the context of real midpointradius interval arithmetic, and for specific calculations, such as matrix products, several authors proposed dedicated solutions that mostly perform operations on centers and then rely on fast bounds on radii [18, 19, 20, 23, 25]. In this case, the real gain comes from the ability to exploit HPC (high performance computing) solutions to linear algebra, and similar tricks sometimes apply to classical interval methods [16, 21].
The use of SIMD instructions for intervals started in [30], and initially led to a rather modest speedup, according
to the author. Changing the rounding mode for almost each arithmetic
operation seriously slows down computations, and might involve a serious
stall of a hundred of cycles by breaking FPU pipelines of
some hardware such as
Throughout this paper, we denote by the set of machine floating point numbers. We let be the machine precision (which corresponds to the number of fractional bits of the mantissa plus one), and let and be the minimal and maximal exponents (included). For IEEE 754 double precision numbers, this means that , and . We enlarge with symbols , , and , with their standard meaning.
For our basic arithmetic, we allow for various rounding modes written . The first three rounding modes correspond to IEEE arithmetic with correct rounding (downwards, to the nearest and upwards). The rounding mode corresponds to faithful rounding without any well specified direction. For , we write for the approximation of in with the specified rounding mode . The quantity stands for the corresponding rounding error, that may be . Given a single operation , it will be convenient to write for . For compound expressions , we will also write for the full evaluation of using the rounding mode . For instance, .
Modern processors usually support fusedmultiplyadd (fma) and fusedmultiplysubtract (fms) instructions, both for scalar and SIMD vector operands: and represent the roundings of and according to the mode . For generalities about these instructions, we refer the reader to the book [15] for instance.
We will denote by any upper bound function for that is easy to compute. In absence of underflows/overflows, we claim that we may take , with for and . Indeed, given and , let be the exponent of , so that and . Then for all rounding modes, and .
If we want to allow for underflows during computations, we may safely take , where is the smallest positive subnormal number in . Notice that if , then we may still take since no underflow occurs in that special case. Underflows and overflows will be further discussed in Section 3.3.
In order to describe our algorithms in a flexible context, denotes a Banach algebra over , endowed with a norm written . The most common examples are and with for all .
For actual machine computations, we will denote by
the counterpart of when
is replaced by . For instance, if , then we may take . In other words,
if is the
For any rounding mode , the notations from the previous section naturally extend to . For instance, . Similarly, if , then . For complex arithmetic we consider the following implementation:
As for we may clearly take in absence of underflows/overflows, and in general.
Remark
Given and , we write for the closed ball with center and radius . The set of such balls is denoted by . One may lift the ring operations in to balls in , by setting:
These formulas are simplest so as to satisfy the so called inclusion principle: given , and , we have . This arithmetic for computing with balls is called exact ball arithmetic. It extends to other operations that might be defined on , as long as the ball lifts of operations satisfy the inclusion principle. For instance, we may implement fusedmultiplyadd and fusedmultiplysubtract using
We will denote by the set of balls with centers in and radii in . In order to implement machine ball arithmetic for , we need to adjust formulas from the previous section so as to take rounding errors into account. Indeed, the main constraint is the inclusion principle, which should still hold for each operation in such an implementation. The best way to achieve this depends heavily on rounding modes used for operations on centers and radii.
On processors that feature IEEE 754 style rounding, it is most natural to perform operations on centers using any rounding mode (and preferably rounding to the nearest), and operations on radii using upward rounding. This leads to the following formulas:
For instance, when using of the form in the last case of fma/fms, this means that three additional instructions are needed with respect to the exact arithmetic from the previous subsection:
In practical implementations, unless , the radius computations involve many roundings. The correctness of the above formulas is justified by the fact that we systematically use upward rounding for all bound computations; the rounding error for the single operation on the centers is captured by .
Remark
Another approach is to use the same rounding mode for computations on centers and radii. If we take as the sole rounding mode, then we may directly apply the above formulas, but the quality of computations with centers degrades. If we take , then we have to further adjust the radii so as to take into account the additional rounding errors that might occur during radius computations. For instance, if , in the cases of addition and subtraction, and in absence of underflows/overflows, we may use
since we have . This arithmetic, sometimes called rough ball arithmetic, fits one of the main recommendations of [22]: “Get free from the rounding mode by bounding, roughly but robustly, errors with formulas independent of the rounding mode if needed.”
The adjustments which where needed above in order to counter the problem of rounding errors induce a non trivial computational overhead, despite the fact that these adjustments are usually very small. It is interesting to consider an alternative transient ball arithmetic for which we simply ignore all rounding errors. In the next Section 3, we will see that it is often possible to treat these rounding errors at a more global level for a complete computation instead of a single operation.
For any rounding mode , we will denote by the corresponding “rounding mode” for transient ball arithmetic; the basic operations are defined as follows:
Of course, these formulas do not satisfy the inclusion principle. On processors that allow for efficient switching of rounding modes, it is also possible to systematically use upward rounding for radius computations, with resulting simplifications in the error analysis below.
Let us fix a rounding mode . We assume that we are given a suitable floating point number in such that and
hold for all and , in absence of underflows and overflows. If , then we may take . If , and assuming that complex arithmetic is implemented using (1–3), then it can be checked that we may take (see Appendix A). The following lemma provides an error estimate for the radius computation of a transient ball product.
Lemma
involves no underflow or overflow, we have .
Proof. We have
A straightline program over a ring is a sequence of instructions of the form
where are variables in a finite ordered set , constants in , and . Variables that appear for the first time in the sequence in the righthand side of an instruction are called input variables. A distinguished subset of the set of all variables occurring at the lefthand side of instructions is called the set of output variables. Variables which are not used for input or output are called temporary variables and determine the amount of auxiliary memory needed to evaluate the program. The length of the sequence is called the length of .
Let be the input variables of and the output variables, listed in increasing order. Then we associate an evaluation function to as follows: given , we assign to for , then evaluate the instructions of in sequence, and finally read off the values of , which determine .
To each instruction , one may associate the remaining path lengths as follows. Let , and assume that have been defined for some . Then we take , where are those indices such that is of the form with and . If no such indices exist, then we set . Similarly, for each input variable we define , where are those indices such that is of the form with and . We also define , where are the input variables of and all indices such that is of the form .
Example
Consider the “semiexact” ball arithmetic in which all computations on centers are done using a given rounding mode and all computations on radii are exact. More precisely, we take
for any and . We wish to investigate how far the transient arithmetic from Section 2.5 can deviate from this idealized arithmetic (which satisfies the inclusion principle). We will write for the th harmonic number and for all .
Theorem
where
If no underflow or overflow occurs during the second evaluation, then for all in the output of the first evaluation with corresponding entry for the second evaluation, we have .
Proof. It will be convenient to systematically use the star superscript for the semiexact radius evaluation and no superscript for the transient evaluation.
Let be the ball value of the variable just after evaluation of using transient ball arithmetic. Let us show by induction over that
So assume that the hypothesis holds for all strictly smaller values of . If is of the form , then we are done by assumption since . If is of the form , then we claim that contains a ball with
Indeed, this holds by assumption if is an input variable. Otherwise, let be largest with . Then by the construction of , whence our claim follows by the induction hypothesis. In a similar way, contains a ball with
Having shown our claim, let us first consider the case when . Then we get
Using the inequalities , , and the fact that , we obtain:
as desired. In the same way, if , then
whence, using Lemma 3, and the fact that ,
which completes our claim by induction.
For all , we introduce
so that . Using a second induction over , let us next prove that
Assume that this inequality holds up until order . If is of the form , then we are done by the fact that .
If is of the form , then let be largest with and . Then , and
In particular, . With and as above, it follows that
If is of the form , then thanks to Lemma 3, a similar computation yields
For fixed and , we observe that . The value of the parameter may be optimized for given SLPs and inputs. Without entering details, should be taken large when inputs are accurate, and small when inputs are rough, as encountered for instance within subdivision algorithms.
In Theorem 5, we have assumed the absence of underflows and overflows. There are several strategies for managing such exceptions, whose efficiencies heavily depend on the specific kind of hardware being used.
One strategy to manage exceptions is to use the status register of an IEEE 754 FPU. The idea is to simply clear the underflow and overflow flags of the status register, then perform the evaluation, and finally check both flags at the end. Whenever an overflow occurred during the evaluation, then we may set the radii of all results to plus infinity, thereby preserving the inclusion principle. If an underflow occurred (which is quite unlikely), then we simply reevaluate the entire SLP using a more expensive, fully certified ball arithmetic.
When performing all computations using the IEEE 754 roundingtonearest mode , we also notice that consulting the status register can be avoided for managing overflows. Indeed, denoting by the largest positive finite number in , we have for all and for all . Consequently, whenever a computation on centers overflows, the corresponding radius will be set to infinity.
From now on, we assume the absence of overflows and we focus on underflows. In addition to the constant , we suppose given an other positive constant in such that
for all and . For instance, if , then we may take . If , and assuming the arithmetic from Section 2.2, then it is safe to take (see Appendix A). In addition to the method based on the status register, the following strategies can be used for managing underflows.
This arithmetic makes Theorem 5 hold without the assumption on the absence of underflows, and provided that
(13) 
Indeed, inequalities (5), (8), and (9) immediately hold, even without extra factors . It remains to prove that (6) also holds. From we have . Therefore, if , then holds, whence inequality (6). Otherwise , and the extra assumption (13) directly implies because always holds by construction.
No corrections are necessary for additions and subtractions which never provoke underflows. As to multiplication, Lemma 3 admits Lemma 6 below as its analogue in the case when . Using this, it may again be shown that Theorem 5 holds without the assumption on the absence of underflows, and provided that
(14) 
Indeed, inequalities (5), (8), and (9) immediately hold. It remains to prove that (6) also holds. The case is handled as for the previous strategy. Otherwise, we have , and the extra assumption (14) directly implies because holds by construction.
Lemma
Proof. We have
Remark
Our new algorithms have been implemented in
Our implementation is divided into
JIT compilation, is a classical technique, traditionally
used in scientific computing: when an expression such as a SLP needs to
be evaluated at many points, it pays off to compile the expression and
then perform the evaluations using fast executable code. When allowed by
the operating system, it suffices to compile SLPs into executable memory
regions, without temporary files. Since SLPs are very basic programs,
there is no special need to appeal to general purpose compilers. In
To develop a confortable JIT library dedicated to SLPs, we turned to the
In order to estimate the concrete overhead of ball arithmetic, we first
focus on timings for double and complex<double>.
In the rest of this article, timings are measured on a platform equipped
with an
In Table 1, column “double” displays timings for evaluating a multivariate polynomial over double, with variables, made of 100 terms, built from random monomials of partial degrees at most 10. We build the SLP using a dedicated algorithm implemented in multimix/dag_polynomial.hpp. Support for this specific coefficient type may be found in multimix/slp_polynomial_double.hpp. This example, with the corresponding functions to reproduce these timings are available in multimix/bench/slp_polynomial_bench.cpp. The evaluation of this SLP takes 1169 products and 100 sums.
The first row corresponds to using the naive interpreted evaluation
available by default from
The third row concerns JIT compilation from
The fourth row is for our JIT implementation in the
The second column of Table 1 shows similar timings for the
complex<double> type from the
We are now interested in measuring the speedup of our evaluation
strategies over balls. The first column of Table 2 shows
timings for balls over double (i.e. ). Early versions of the
Notice that we carefully tuned the assembly code generated by our SLP compiler. For instance, if ymm0 contains and if ymm1 contains a center , then is obtained as vxorpd ymm0 ymm1 ymm2, and as vandnpd ymm0 ymm1 ymm2. The latency and throughput of both these instructions are a single cycle and no branchings are required to compute . In this way, each transient addition/subtraction takes 2 cycles, and each product 6 cycles. For rough arithmetic this increases to 5 and 9 cycles respectively. The gain of the transient arithmetic is thus well reflected in practice, since our example essentially performs products. Comparing to Table 1, we observe that transient arithmetic is just about 4 times slower than numeric arithmetic. This turns out to be competitive with interval arithmetic, where each interval product usually requires 8 machine multiplications and 6 min/max operations.
The second column of timings in Table 2 is similar to the first one, but with balls over complex<double>. The computation of norms is expensive in this case because the scalar square root instruction takes 14 CPU cycles. In order to reduce this cost, we rewrite SPLs so that norms of products are computed as products of norms. Taking care of using or simulating upwards rounding, this involves a slight loss in precision but does not invalidate the results. At the end, comparing to Table 1, we are glad to observe that our new transient arithmetic strategy is only about twice as expensive as standard numeric evaluations.


Table 2. Polynomial evaluation with 100 terms of degree 10 in 10 variables, in μs. 
In order to profit from SIMD technologies, we also
implemented vectorization in our
In this paper, we have shown how to implement highly efficient ball
arithmetic dedicated to polynomial evaluation. In the near future, we
expect significant speedups in our numerical system solvers implemented
within
It is interesting to notice in Table 2 that the code generated by our rather straightforward JIT compiler for SLPs is more efficient than the code produced by GCC. This suggests that it might be worthwhile to put more efforts into the development of specific JIT compilers for SLPs dedicated to high performance computing.
We also plan to adapt the present results to standard intervals, that are useful for real solving. In that case, we replace input and constant intervals by intervals of the form , where for suitable . We next proceed as usual, but without any assumption on the rounding mode.
We finally notice that interval arithmetic benefits from hardware accelerations on many current processors, as soon as IEEE 754 style rounding is integrated in an efficient manner. An interesting question is whether ball arithmetic might benefit from similar hardware accelerations. In particular, is it possible to integrate rounding errors more efficiently into error bounds (the radii of balls)? This might for instance be achieved using an accumulateroundingerror instruction for computing a guaranteed upper bound for as a function of .
[1] G. Alefeld and J. Herzberger. Introduction to interval analysis. Academic Press, New York, 1983.
[2] D. Bates, J. Hauenstein, A. Sommese, and C. Wampler. Bertini: software for numerical algebraic geometry. http://www.nd.edu/~sommese/bertini/, 2006.
[3] C. Beltrán and A. Leykin. Certified numerical homotopy tracking. Experiment. Math., 21(1):69–83, 2012.
[4] F. Goualard. Fast and correct SIMD algorithms for interval arithmetic. Technical Report, Université de Nantes, 2008. https://hal.archivesouvertes.fr/hal00288456.
[5] J. van der Hoeven. Ball arithmetic. In A. Beckmann, Ch. Gaßner, and B. Löwe, editors, Logical approaches to Barriers in Computing and Complexity, number 6 in PreprintReihe Mathematik, pages 179–208. ErnstMoritzArndtUniversität Greifswald, 2010. International Workshop.
[6] J. van der Hoeven. Reliable homotopy continuation. Technical Report, CNRS & École polytechnique, 2011. http://hal.archivesouvertes.fr/hal00589948.
[7] J. van der Hoeven and G. Lecerf. Interfacing Mathemagix with C++. In M. Monagan, G. Cooperman, and M. Giesbrecht, editors, Proc. ISSAC '13, pages 363–370. ACM, 2013.
[8] J. van der Hoeven and G. Lecerf. Mathemagix User Guide. CNRS & École polytechnique, France, 2013. http://hal.archivesouvertes.fr/hal00785549.
[9] J. van der Hoeven, G. Lecerf, B. Mourrain et al. Mathemagix. 2002. http://www.mathemagix.org.
[10] L. Jaulin, M. Kieffer, O. Didrit, and E. Walter. Applied interval analysis. Springer, London, 2001.
[11] U. W. Kulisch. Computer Arithmetic and Validity. Theory, Implementation, and Applications. Number 33 in Studies in Mathematics. De Gruyter, 2008.
[12] B. Lambov. Interval arithmetic using SSE2. In P. Hertling, C. M. Hoffmann, W. Luther, and N. Revol, editors, Reliable Implementation of Real Number Algorithms: Theory and Practice, volume 5045 of Lect. Notes Comput. Sci., pages 102–113. Springer Berlin Heidelberg, 2008.
[13] R. E. Moore. Interval Analysis. Prentice Hall, Englewood Cliffs, N.J., 1966.
[14] R. E. Moore, R. B. Kearfott, and M. J. Cloud. Introduction to Interval Analysis. SIAM Press, 2009.
[15] J.M. Muller, N. Brisebarre, F. de Dinechin, C.P. Jeannerod, V. Lefèvre, G. Melquiond, N. Revol, D. Stehlé, and S. Torres. Handbook of FloatingPoint Arithmetic. Birkhäuser Boston, 2010.
[16] Hong Diep Nguyen. Efficient algorithms for verified scientific computing: numerical linear algebra using interval arithmetic. PhD thesis, École Normale Supérieure de Lyon, Université de Lyon, 2011.
[17] A. Neumaier. Interval methods for systems of equations. Cambridge university press, Cambridge, 1990.
[18] T. Ogita and S. Oishi. Fast inclusion of interval matrix multiplication. Reliab. Comput., 11(3):191–205, 2005.
[19] K. Ozaki, T. Ogita, and S. Oishi. Tight and efficient enclosure of matrix multiplication by using optimized BLAS. Numer. Linear Algebra Appl., 18:237–248, 2011.
[20] K. Ozaki, T. Ogita, S. M. Rump, and S. Oishi. Fast algorithms for floatingpoint interval matrix multiplication. J. Comput. Appl. Math., 236(7):1795–1814, 2012.
[21] N. Revol and Ph. Théveny. Parallel implementation of interval matrix multiplication. Reliab. Comput., 19(1):91–106, 2013.
[22] N. Revol and Ph. Théveny. Numerical reproducibility and parallel computations: issues for interval algorithms. IEEE Trans. Comput., 63(8):1915–1924, 2014.
[23] S. M. Rump. Fast and parallel interval arithmetic. BIT, 39(3):534–554, 1999.
[24] S. M. Rump. Verification methods: rigorous results using floatingpoint arithmetic. Acta Numer., 19:287–449, 2010.
[25] S. M. Rump. Fast interval matrix multiplication. Numer. Algor., 61(1):1–34, 2012.
[26] N. Stolte. Arbitrary 3D resolution discrete ray tracing of implicit surfaces. In E. Andres, G. Damiand, and P. Lienhardt, editors, Discrete Geometry for Computer Imagery, volume 3429 of Lect. Notes Comput. Sci., pages 414–426. Springer Berlin Heidelberg, 2005.
[27] A. J. Sommese and C. W. Wampler. The Numerical Solution of Systems of Polynomials Arising in Engineering and Science. World Scientific, 2005.
[28] J. Verschelde. Homotopy continuation methods for solving polynomial systems. PhD thesis, Katholieke Universiteit Leuven, 1996.
[29] J. Verschelde. PHCpack: a generalpurpose solver for polynomial systems by homotopy continuation. ACM Trans. Math. Software, 25(2):251–276, 1999. Algorithm 795.
[30] J. W. Von Gudenberg. Interval arithmetic on multimedia architectures. Reliab. Comput., 8(4):307–312, 2002.
[31] GCC, the GNU Compiler Collection. Software available at http://gcc.gnu.org, from 1987.
Let and be two complex numbers in . We will show that
(15) 
(16) 
(17) 
In addition, in absence of underflows, the terms in may be discarded.
For short we set , and we shall freely use the assumption . The first bound (15) is immediate. As for the second one, it will be convenient to use the norm , that satisfies . We begin with combining
and
into
We deduce that
whence
which simplifies into
It follows that
which simplifies into (16).
As to (17), we begin with and , that give
It follows that
from which we extract
We thus obtain that
We distinguish the two following cases:
If , then . Using
since the square root does not provoke underflows, we obtain
Otherwise , which means that both products and involve underflows. We restart the analysis from
which implies . If , then we are done, since . Otherwise , and , and we have