On high precision integration of
stiff differential equations

Joris van der Hoeven

CNRS, Laboratoire d'informatique

Campus de l'École polytechnique

1, rue Honoré d'Estienne d'Orves

Bâtiment Alan Turing, CS35003

91120 Palaiseau

France

Email: vdhoeven@lix.polytechnique.fr

Draft version, November 4, 2023

In this paper, we present various new algorithms for the integration of stiff differential equations that allow the step size to increase proportionally with time. We mainly focus on high precision integrators, which leads us to use high order Taylor schemes. We will also present various algorithms for certifying the numerical results, again with the property that the step size increases with time.

Keywords: stiff differential equation, numerical method, reliable computation, rigorous integration, Taylor models, multiple precision computations

A.M.S. subject classification: 65L04, 65L20, 65G20, 37-04

1.Introduction

Consider the differential equation

(1)

where is a diagonal matrix with positive real entries , and where is a polynomial forcing term. We are interested in analytic solutions , where is either an interval of the form with , or a larger subset of that contains an interval of this form. The components of this mapping will be written , whereas subscripts will be reserved for the coefficients of local power series solutions. A similar notation will be used for the components of other vectors and matrices.

If the largest eigenvalue of gets large with respect to the forcing term , then the equation (1) is said to be stiff. Generic numerical integration schemes experience difficulties with this kind of equations, whether we use Euler's method, a Runge–Kutta scheme, or high order algorithms based on Taylor expansions. Roughly speaking, the problem is that all generic methods become inaccurate when the step size exceeds . One of the most difficult cases is when is large, but each of the quotients remains reasonably close to one. This happens for instance when , a case that is naturally encountered when discretizing certain partial differential equations.

Implicit integration schemes allow for larger step sizes than generic methods, but involve the computation and inversion of Jacobian matrices, which may be expensive in high dimension . But even such more expensive schemes only seem to deal with the step size issue in a somewhat heuristic manner: one typically selects a scheme that is “stable” when applied to any equation for all , and then hopes that the scheme continues to behave well for the actual target equation (1). We refer to [25, Section 17.5] and [4, Section 6] for more information on classical integration schemes for stiff differential equations.

We are mainly interested in high precision computations, in which case it is natural to use high order integration schemes that are based on Taylor expansions. In this context, we are not aware of any numerical method that allows the step size to increase proportionally with time. The main aim of this paper is to present such a method, together with several variants, as well as a way to compute rigorous bounds for the error. Our fastest method is an explicit scheme, but its convergence deteriorates when the eigenvalues are close. The slower and rigourous counterparts rely on the computation of Jacobian matrices.

In sections 3 and 4, we recall various approaches that can be used to compute truncated series solutions to initial value problems and how to derive high order integration schemes from this. More precisely, given a numerical approximation for at time , we compute the first terms of a power series solution to the initial value problem

(2)

and return as an approximation of for some suitable step size . The inaccuracy of such schemes for larger step sizes is due to the fact that the -th coefficient of tends to change proportionally to for slight perturbations of . If is large, then even small multiples of quickly dominate , which severely limits the step sizes that we can use.

This numerical instability is an unpleasant artifact of traditional methods for the integration of differential equations. It is well know that solutions of stiff equations tend to become very smooth after a certain period of time (in which case one says that the system has reached its steady state), but this does not lead to larger step sizes, as we would hope for.

In order to make this idea of “smoothness after a certain period of time” more precise, it is instructive to study the analytic continuation of in the complex plane. In section 2, for a fixed initial condition and a fixed forcing term , we show the existence of a compact half disk , , on which the solution of (1) is analytic, for any choice of . In particular, for , the radius of convergence of the exact Taylor expansion of at time is at least . Setting , Cauchy's formula thus yields the bound for all .

If , then this means that efficient integration schemes should allow for step sizes of the order at time , for some fixed constant . In section 5, we will present various such schemes. They are all based on the computation of good approximations of the exact Taylor expansion of at time and order . By what precedes, such an approximation cannot be obtained by solving (2). Instead, for a suitable so-called critical index that depends on , we will solve the following so-called steady-state problem:

(3)

Intuitively speaking, the system has reached a steady state for the eigenvalues with . Such eigenvalues are “large”, so it is more accurate to determine coefficients as a function of later coefficients rather than earlier ones. Furthermore, the coefficients are so “small” that they can be approximated by zeros; this explains the steady-state conditions . The remaining eigenvalues with are small enough for the step size under consideration that the dependence of on the initial condition is sufficiently moderate for (3) to admit an accurate solution. We say that the system (1) is in a transient state for such eigenvalues at time .

A final topic is the computation of rigorous error bounds for the numerical integration process. For differential equations that are not stiff, this is a classical theme in interval analysis [21, 15, 5, 16, 18, 19, 11]. It has long been an open problem to develop efficient reliable integrators for stiff differential equations. We report on progress in this direction in section 6.

2.Analytic continuation

Consider the equation (1) for a fixed initial condition and a fixed forcing term , but where we will allow ourselves to vary . Our main aim is to show that there exists a compact “half disk”

with on which the solution to (1) exists and is analytic, independently of the choice of . We will both prove a weak and simpler version of this result and a stronger version that allows us to take larger radii and that is also more convenient if we want to actually compute a radius that works.

2.1.Notations

We will need a few more notations. We denote the interior of by

Given a non-empty open subset , we write for the Banach space of analytic functions with finite sup-norm

We will often use vector notation: given , , and , we write

In addition to such “componentwise” norms, we will also use more traditional sup-norms: given , , and an matrix , we define

2.2.Reminders from complex analysis

We will also need two well known results from complex analysis, namely an explicit version of Cauchy-Kovalevskaya's theorem and a variant of Montel's theorem.

Theorem 1. Let , , and consider the initial value problem

(4)

Writing , let

Then (4) admits a unique analytic solution on the open disk with center and radius

Proof. Our assumptions imply that

constitutes a “majorant equation” for (4) in the sense of Cauchy-Kovalevskaya. By symmetry, each of the components of this equation satisfies the simpler equation

which admits the explicit solution

of radius . Since the formal power series solution of (4) satisfies for all and , it follows that has radius of convergence at least .

Theorem 2. Given a non-empty subset of and a bounded sequence , we can extract a subsequence that converges uniformly to a limit on every compact subset of .

Proof. If , then this is Montel's theorem. The general case is obtained using an easy induction on : we first extract a subsequence such that converges to a limit in and then apply the induction hypothesis to this subsequence for the remaining components.

2.3.Analyticity on open half disks

Let denote the Jacobian matrix of .

Theorem 3. Let , , and be such that for all , we have

(5)
(6)

Then, for any choice of , the initial value problem

(7)

admits a unique analytic solution on . In addition, we have for all .

Proof. Consider the operator with

The fixed points of this operator are precisely the solutions of (7). Let us prove the existence of such a fixed point. The uniqueness follows by analytic continuation from the well known uniqueness of the solution of the initial value problem (7) on a neighbourhood of the origin.

Let us first notice that maps the ball

with center and radius into itself. Indeed, given and , our hypothesis (5) implies

For all , it follows that

whence , as desired.

Let us next show that is actually contracting on . Given , consider the homotopy with . From (6), we get

It follows that

and

This shows that is indeed contracting on . Since is complete as a closed bounded subspace of , we conclude that converges to a fixed point of .

2.4.A refinement for larger compact half disks

It turns out that the condition (6) on the Jacobian is not really needed. Moreover, the solution can further be extended to the closure of .

Theorem 4. Let and be such that for all , we have

(8)

Then, for any choice of , the initial value problem

(9)

admits a unique analytic solution on . In addition, we have for all .

Proof. Let be as in the proof of Theorem 3 and notice that still maps into itself. Now consider the sequence with . Applying Montel's theorem, we obtain a subsequence that converges uniformly to a limit on every compact subset of .

We claim that is a fixed point of . Indeed, for a sufficiently small with , we have for all with . As in the proof of Theorem 3, this means that is contracting on . Consequently, the restrictions of the functions to with tend to a fixed point , and so does the subsequence . Now this fixed point coincides with on and also solves the initial value problem (7). By analytic continuation, therefore solves the same initial value problem on , which completes the proof of our claim.

It remains to be shown that can be extended to an analytic function on that satisfies for all . Since on , we first notice that is bounded on , whence continuously extends to a function that is defined on the whole of , and we clearly have for all .

Now let us consider the unique solution to the initial value problem with . Theorem 1 provides us with a lower bound for the radius of convergence of , which depends continuously on . By the compactness of , it follows that there exists some with for all . Now consider on the boundary of and let be such that . Then and on some neighbourhood of . We conclude that is an analytic extension of to the open disk that contains .

3.Computing formal power series solutions

Before we present algorithms for the numeric integration of (1), let us first consider the question how to compute formal power series solutions. Since we will later be considering power series solutions at various times , we will use the letter instead of for such solutions. Notice that a vector of formal power series can also be regarded as a formal power series with coefficients in .

Let be the initial condition. Setting , we are thus interested in the computation of truncated power series solutions to the equation

(10)

Alternatively, this equation can be rewritten in integral form

(11)

where the integration operator sends to .

In this section, we recall several approaches to the computation of power series solutions to such equations. The efficiency of each method can be measured in terms of the number of field operations in that are required to compute the first terms of the solution. For the time being, we assume that all computations are done at a fixed bit precision , so that operations in are done at unit cost. One may also take into account the number of scalar operations that are necessary for the evaluation of . A further refinement is to separately count the number of multiplications that are needed. For instance, if and , then we have and . We always assume that .

Iterative method
For the first method, we systematically compute with truncated power series of order in , which are also called jets of order . One addition in reduces to additions in and similarly for scalar multiplications with an element in . A naive multiplication in requires operations in , although this cost can be reduced to using FFT techniques. The integration operator sends to and can be computed in linear time.

Now assume that is an approximate solution to (11) whose first terms are correct. In other words, if is the actual solution and is a preimage of with , then . Given such an approximation, one iteration

(12)

of (11) yields a new approximation whose first terms are correct. Starting with , we thus obtain a solution modulo of (11) after at most iterations. The total cost of this computation is bounded by , or by when using FFT-techniques.

Recurrence relations
Since is a polynomial, it is built from the components of using additions, multiplications, and scalar multiplications with constants in . For sums, scalar products, general products, and integrals of power series, we may extract their coefficients in using the following rules

(13)
(14)
(15)
(16)

Applying this recursively to the polynomial expression , the iteration (12) yields a recursion relation

(17)

that allows us to compute from , by induction on . Proceeding in this way, the computation of the solution at order requires operations.

The lazy power series approach
The above approach of computing the coefficients successively using the recursion relation (17) can be reformulated elegantly in the framework of lazy power series computations. The idea is to regard a power series as given by its stream of coefficients. Basic arithmetic operations on power series are implemented in a lazy manner: when computing the -th coefficient of a sum , a product , or an integral , we compute “just the coefficients that are needed” from and . The natural way to do this is precisely to use the relations (1316).

The lazy power series approach has the important property that the -th coefficient of a product (say) becomes available as soon as and are given. This makes it possible to let and depend on the result , as long as the computation of and only involves previous coefficients of . As a consequence, the fixed point equation (11) can be solved simply by evaluating the right hand side using lazy power series arithmetic. Indeed, the -th coefficient only depends on previous coefficients of .

The relaxed power series approach
The main drawback of the lazy power series approach is that the computation of a product at order using (15) requires operations. Here we recall that FFT techniques allow us to compute a product in using only operations.

One essential observation is that, in order to solve (11) using the lazy power series approach, we only relied on the fact that each coefficient becomes available as soon as and are given. In fact, it is possible to design faster multiplication methods that still have this property; such methods are said to be relaxed or on-line. A relaxed multiplication method that computes a product at order in time was presented in [8] and can be traced back to [6]. An even faster algorithm of time complexity was given in [9].

Denoting by the cost of relaxed multiplication at order , the resolution of (11) at order now requires only operations.

Newton's method
Yet another idea to speed up computations is to replace the iteration (12) with a Newton-style iteration with faster convergence. This idea was first used by Brent and Kung [2] to show that (11) can be solved at order in time . However, this complexity analysis does not take into account the dependence on and . In particular, the dependence on of Brent and Kung's method is exponential [8]. Faster algorithms were proposed in [26, 1], based on the simultaneous computation of the solution and its first variation. This allows for the computation of a solution of (10) at order in time . Whenever , the computation time can be further reduced to [10]. For small , this leads to the asymptotically most efficient method for solving (10). For large , the computation of the first variation induces a overhead, and the relaxed method usually becomes more efficient.

4.Numerical integration using Taylor expansions

Let as before. The aim of this section is to present a naive algorithm for the numerical integration of the differential equation

(18)

based on the computation of truncated power series solutions at successive times . We will use a fixed expansion order , a fixed bit precision , but an adaptive step size . At , we are given an initial condition and our aim is to find a good numeric approximation for at time . The algorithm is not designed to be efficient when the equation gets stiff and we will present a heuristic discussion on what goes wrong when this happens.

4.1.Naive integration using Taylor series

Let us write for the Taylor series expansion of at time and for its truncation at order . In other words, setting

we have and

If the time at which we expand is clear from the context, then we will simply drop the postfix and write instead of . Conversely, for any other quantities that implicitly depend on , we will use the postfix to make this dependence explicit.

So let be the power series expansion of at a fixed time . In view of (18), this power series satisfies the equation

(19)

In the previous section, we have recalled various methods for computing truncated power series solutions as a function of the initial condition at time . In what follows, we will use any of these algorithms as a black box, and show how to device a numerical integration scheme from that.

Obviously, given and an appropriate step size at time , the idea is to compute and simply evaluate

We next continue with in the role of and with a suitably adapted step size . The main question is therefore to find a suitable step size . Now the expected order of magnitude of is given by

(20)

Since we are computing with bits of precision, we wish to keep the relative error of our numeric integration scheme below , approximately. Therefore, we need to ensure that the truncation error

(21)

remains bounded by . Although we have not computed any of the coefficients for , a reasonable approximate upper bound for is given by

(22)

where is a small positive integer, called the number of guard terms. In order to protect ourselves against the occasional vanishing of , it is wise to take . Nevertheless, a small value such as or should usually provide acceptable upper estimates for . We now simply take the step size to be maximal with . This leads to the following algorithm for the numerical integration of (18):

Algorithm 1

Input: an initial condition and

Output: a numerical approximation for , where satisfies (18) with

,

while do

Compute the truncated solution to (19) with

Let be maximal such that , with and as in (20) and (22)

,

return

4.2.Step size and error analysis

Let denote the exact solution of (18) with and let denote the computed approximation by Algorithm 1. In order to analyze various aspects of the algorithm, it is instructive to look at the radius of convergence of at time . Roughly speaking, with the notations from the previous subsection, the coefficients of the power series expansion of at time grow as

Ideally speaking, if we were able to compute with sufficient accuracy, then the coefficients should grow in a similar way. Setting for the step size in this ideal situation, we would then expect that , whence

(23)

This suggests to take the expansion order to be proportional to , after which the step size should be proportional to the radius of convergence .

Let us pursue this line of wishful thinking a little further. Theorem 4 implies that is analytic on a compact half disk that is independent of . In particular, we get that for . It can also be shown that the radius of convergence of at the origin is of the order or more. For stiff equations, we typically have . In order to integrate the equation until a time , we thus hope for a step size that increases geometrically from to . The entire integration would then require approximately steps. Using the most efficient algorithms from section 3, each step requires floating point operations (here the “flat Oh” notation stands for ). Since additions and multiplications of bit floating point numbers can be performed in time , the overall bit cost of the entire integration would thus be bounded by .

But are we indeed able to compute with enough accuracy in order to ensure that the coefficients grow according to ? Let us carefully analyze each of the sources of error for one step of our integration scheme. By construction, we ensured the truncation error to be of the order . One of the most intrinsic sources of error comes from the initial condition : since we are computing with a precision of bits, the mere representation of induces a relative error of the order of . Even when computing from with infinite precision, this intrinsic source of error cannot be removed.

Let us now study how the error in the initial condition affects the errors in the other coefficients . For this, we need to investigate the first variation of , which describes the sensitivity of the flow to the initial condition. More precisely, let be the power solution of (19) at time as a function of the initial condition . Then the first variation is the matrix with entries . Dropping suffixes when they are clear from the context, the first variation satisfies the linear differential equation

Here we recall that stands for the Jacobian matrix of . If our equation is very stiff, then is small with respect to , which leads to the extremely crude approximation for the first variation.

Now the relative error in the initial condition is at best , as explained above, which means that for . In combination with the relation , this leads to the following errors for the other coefficients:

Now if , then the error dominates the actual value , which yields instead of . When chosing our step size such that , as in Algorithm 1, this yields and

(24)

instead of the desired step size . The actual bit cost of the complete integration is therefore bounded by instead of .

An interesting aspect of this analysis is the fact that the step size still turns out to be times larger than , whence larger orders allow for larger step sizes. However, this comes at the expense of a larger relative error than . Indeed, we have

This error is maximal for , in which case we have

This means that the relative error for one step of our integration method is instead of . In other words, we have “sacrificed” bits of precision, so that the method admits an “effective precision” of only bits.

The last source of errors for Algorithm 1 comes from rounding errors during the computation of from . The nature of these errors depends on the particular method that we used for computing the power series . Nevertheless, for most methods, the rounding errors only contribute marginally to the total error. This is due to the fact that for , so the rounding errors are absorbed by the errors induced by the error in the initial condition .

5.Fighting stiffness

Let us continue with the notations from section 4 and its subsection 4.2. In particular, we assume that the exact solution to (1) with initial condition is analytic on the compact half disk . We also denote by an upper bound for on , so that and

for all and , by Cauchy's theorem. From now on, we assume that , so that Algorithm 1 only allows us to use a step size of the order (24) instead of (23). The aim of this section is to present a new way to compute truncated power series solutions of the equation

(25)

at time , with the property that is a good approximation of the true truncated solution . In a similar way as in section 4, we will then use this to derive an algorithm for the numerical integration of (1). Contrary to before, the property that for allows us to take step sizes of the desired order (23) for .

5.1.Integrating stiff equations using Taylor series

We stress once more that the reduced step size for Algorithm 1 is a consequence of our choice to compute the truncated solution of (25) in terms of the initial condition (that can only be known approximately) and not of the choice of the particular algorithm that is used for this computation.

Indeed, as explained in section 4.2, only a small change in of the order of can have a dramatic effect on the solution , since the coefficient can change by as much as . Since for large , it follows that a minor change in leads to a completely incorrect computation of the coefficients with close to .

For small , the error generally does remain bounded by the actual value . In the theory of stiff differential equations, it is customary to say that the system is in a transient state for such . As soon as the error exceeds , we say that the system has reached its steady state for the largest eigenvalue of . When this happens, the system may still be in a transient state for some of the other eigenvalues . This motivates the definition of the critical index as being the largest index such that we are in a transient state for ; we take if we reached the steady state for all eigenvalues .

The concepts of transient state, steady state, and critical index are deliberately somewhat vague. As a tentative definition, we might say that we reached the steady state for if . However, for computational purposes, it is convenient to interpret this inequality as an approximate one. The crucial property of reaching the steady state for is that the smallness of essentially determines the -th component of the initial condition up to the last -th bit.

The main idea behind our algorithm is to use this property as a “boundary condition” for the computation of . More precisely, we boldly impose the boundary conditions as a replacement for the initial conditions , while keeping the remaining initial conditions for the transient states. In other words, we wish to find the truncated solution of the system

(26)

We will call the steady-state condition for and (26) a steady-state problem.

In order to solve (26), it is natural to adapt the iterative method from section 3 and introduce the operator with

(27)

The actual arithmetic is performed over , but it will be more convenient to view as an operator from into itself. The computation of as a fixed point of leaves us with two questions: how to chose a suitable ansatz for the iteration and how to determine the critical index ?

For the ansatz, we go back to the solution from the previous step at time and simply use as a first approximation of the solution at time . For , we fall back to the traditional method from section 4. For the initial and steady-state conditions to “propagate to the other end”, at least iterations are required in order to find a fixed point of , whereas iterations usually suffice. One may thus take .

As to the critical index , we determine it as a function of the step size through

(28)

This choice is meant to ensure the fastest convergence of the iteration and we used as an approximation of . The step size itself is chosen in the same way as in section 4. Since is not yet available for the computation of , we may use the previous step size as an approximation for in (28). Optionally, one may also wish to implement a few additional sanity checks in order to ensure convergence of the -iteration and decrease the step size in case of failure. One useful such check is to verify that .

Altogether, this leads to the following algorithm:

Algorithm 2

Input: an initial condition and

Output: a numerical approximation for , where satisfies (1) with

,

Compute the truncated solution to (19) with

while do

Let be maximal such that , with and as in (20) and (22)

,

Replace by the approximate fixed point of as in (27)

return

5.2.Convergence of the -iteration

Let us now study the convergence of as a mapping on the -dimensional vector space over . We define a norm on this space by

For a linear map on this space, represented as a matrix , we have the corresponding matrix norm

Recall that and stand for the Jacobian matrices of and .

Theorem 5. Assume that is a fixed point of and that and are constants such that for all and . Then

Proof. Consider an infinitesimal perturbation of with . Given and , we have

Similarly, given and , we have

Putting both relations together, we obtain .

Assume that

(29)

Then the theorem implies the existence a small neighbourhood of the fixed point of on which the -iteration converges to . Whenever this condition is met, the -iteration actually tends to converge on a rather large neighbourhood that includes our ansatz; see the next subsection for a more detailed discussion. Intuitively speaking, the condition requires the eigenvalues to be sufficiently separated with respect to the norm of the forcing term . Even when the condition does not hold, the -iteration usually still displays an initial convergence for our ansatz, but the quality of the approximate solution to (26) ceases to improve after a while.

If the condition (29) is satisfied for all critical indices that we encounter when integrating from until time , then Algorithm 2 should produce an accurate result. The idealized analysis from section 4.2 then also applies, so the algorithm takes steps. Since each step now requires floating point operations at bit precision , we finally obtain the bound for the total running time.

5.3.Quality of the computed solution

Let us now investigate in more detail why fixed points of the -iteration indeed approximate the true solution quite well. For this, we will determine a more precise small neighbourhood of the fixed point on which the -iteration converges and show that this neighbourhood in particular contains . We start with two lemmas.

Lemma 6. Let be a matrix of analytic functions defined on the closed ball with center zero and radius . Let . Then for all .

Proof. Given with , we have

It follows that .

Lemma 7. Let , , , , and

Then for all with and , we have

Proof. Setting , we first notice that

It follows that

and we conclude using the previous lemma.

Theorem 8. Let and be such that

Let and be such that

Then the sequence tends to a unique fixed point on the set

Proof. A straightforward adaptation of the proof of Theorem 5 shows that on the ball , which means that on this ball. By induction on , it follows that and . We conclude that converges to a fixed point of in .

Returning to the Taylor series expansion of the exact solution of (1) at time , we notice that

It follows that

Now assuming that the aimed step size was more or less achieved at the previous step, it follows that is of the desired order. If the condition (29) is indeed satisfied for , we thus should be able to apply Theorem 8 and conclude that the computed fixed point is within distance for the norm. Using a similar reasoning, we also see that the ansatz at the next step will be sufficiently close to the true solution for the -iteration to converge.

5.4.Solving steady-state problems through the first variation

A more robust but costly approach to solve the steady-state problem (26) is to compute the initial values with sufficient precision using Newton's method. Given a tentative approximation for the initial condition, we both compute and the first variation , after which we update by solving the linear system

This method admits quadratic convergence, but it requires us to compute with a precision of bits at least in order to be accurate. Indeed, this comes from the fact that grows roughly as . On the upside, we may compute and using any of the algorithms from section 3. The total running time is therefore bounded by . Notice also that it actually suffices to compute the last rows of , due to the requirement that .

The main disadvantage of the above method is that the computation of the first variation along with induces an overhead of , which may be a problem for systems of high dimension. Let us now sketch how one might reduce this overhead by combining the variational approach with the -iteration. We intend to return to a more detailed analysis in a future work.

Instead of using a single critical index , the first idea is to use a range of indices starting at , and such that the condition (29) holds for

The second idea is to only vary the components of the initial condition and use the steady-state conditions for the indices .

More specifically, for a tentative initial condition , we first solve the steady-state problem

using the -iteration technique. In a similar way, we next solve the following variational steady-state problem for the matrix :

As our ansatz, we may use for all . Having computed and at precision , we finally update by solving the linear system

and setting We repeat this whole process until is sufficiently close to zero for .

Remark 9. The algorithms in this section are reminiscent of implicit numerical schemes for the integration of (1). One interesting difference is that our second optimized method only needs to compute a small part of the full Jacobian matrix.

Remark 10. Instead of imposing the exact steady-state conditions , yet another approach would be to minimize the norm under the condition that . This approach admits the advantages that one does not need to know and that it might be applied to more general complex matrices . However, it requires the computation of the full Jacobian matrix.

6.Certification

In the previous two sections, we have described numerical algorithms for integrating (1). An interesting question is how to compute a tight error bound such that the distance between the true and the computed solutions lies within this error bound. Ball arithmetic provides a suitable framework for such computations. This variant of interval arithmetic is well suited for high precision computations with complex numbers and we will briefly recall its basic principles in section 6.1. As a first application, we show how to make Theorem 4 more effective.

The certified integration of differential equations that are not stiff (i.e. the robust counterpart of section 4) is a classical topic in interval arithmetic [20, 21, 15, 5, 24, 16, 7, 22, 18, 14, 17, 23, 19]. For recent algorithms of good complexity, much in the spirit of the present paper, we refer to [11]. Most of the existing algorithms rely on Taylor series expansions as we do, while providing rigorous tail bounds for the truncation error.

The effective counterpart of Theorem 4 provides suitable tail bounds in the case of stiff differential equations. In sections 6.3 and 6.4, we show how to use this for the certified resolution of steady-state problems using the -iteration from section 5.1. Unfortunately, our first algorithms are rather naive and typically lead to heavily overestimated error bounds. The classical way to reduce this overestimation is to also compute the first variation and apply the mean value theorem: see section 6.5 for how to do this in our context.

6.1.Ball arithmetic

Given and , we write for the closed ball with center and radius . The set of such balls is denoted by or . One may lift the ring operations in to balls in , by setting:

These formulas are simplest so as to satisfy the so called inclusion principle: given , and , we have . This arithmetic for computing with balls is called exact ball arithmetic. It extends to other operations that might be defined on , as long as the ball lifts of operations satisfy the inclusion principle. Any ordinary complex number can be reinterpreted as a ball . Given a ball , we also notice that and provide us with reliable upper and lower bounds for in .

Another interesting operation on balls that we will need below is intersection. Assuming that the set intersection is non-empty, we define the ball intersection to be the ball of smallest radius that contains . In order to determine this ball intersection, we may assume without loss of generality that . If , then . Otherwise, let be the two (possibly identical) intersections of the circles and . If , then we still have . Otherwise, .

It will also be convenient to extend vector notations to balls. First of all, we identify vectors of balls with “ball vectors” . Given and , we also write if and only if for . Similar remarks apply to ball matrices, ball series, etc.

Remark 11. For effective computations, one can only work with approximations of real and complex numbers of finite precision. The IEEE 754 norm provides a standard for real floating point arithmetic based on the concept of “correct rounding”. It naturally generalizes to floating point numbers with a mantissa of bits and an exponent of bits [3]. Let and the sets of real and complex floating point numbers of this type (strictly speaking, one also has for the reliable rounding of overflowing operations). One may adapt the above arithmetic on exact balls to floating point balls in while preserving the inclusion principle: it suffices to slightly increase the radii of output balls so as to take care of rounding errors. For precise formulas and interesting variations, we refer to [13]. For simplicity, the sequel will be presented for exact ball arithmetic only, but it is not hard to adapt the results to take into account rounding errors.

6.2.Computing bounds on compact half disks

The polynomials in (1) can either be represented as linear combinations of monomials, or using a dag (directed acyclic graph). In both cases, the ball arithmetic from the previous subsection allows us to reliably evaluate at balls in .

For the effective counterpart of Theorem 4, there is a trade-off between the qualities of the radius (larger radii being better) and the bound (smaller values of being better). For a given , it is simple to compute the “best” corresponding that satisfies the condition (8), by taking . Conversely, for a fixed , let us recall a classical technique from interval analysis that can be used to compute an “almost best” corresponding bound that satisfies (8).

We simply construct a sequence in by taking and for all . If there exists a finite for which (8) holds, then the sequence usually converges to a minimal such quite quickly. After say ten iterations, we should therefore have a reasonable approximation. One then slightly inflates the last value by taking . If , then we have succeeded in finding a suitable . If not, then we return “failed”.

Using the above procedure, we may also compute a reasonably large compact half disk on which is analytic, together with a bound for : we simply perform a dichotomic search for the largest for which the computation of a corresponding does not fail. If we also wish the bound to remain reasonably small, one may simply divide by two at the end of the search and compute the corresponding .

6.3.Certified integration of stiff differential equations

Let us now return to the integration of (1) and let be the exact solution for the initial condition . Let and be computed as above such that (8) holds. Assume that we were able to reliably integrate (1) until a given time and let , so that .

In order to adapt the -iteration to ball arithmetic, we first deduce from Theorem 4 that for . This provides us with the required steady-state conditions , in addition to the initial conditions . The ball counterpart of our steady-state problem thus becomes

(30)

We correspondingly define the ball version of for by

This map has the property that . In our ball context, we may actually iterate using the following improved version of :

Here we understand that the intersections are taken coefficientwise. Since iterations using can only improve our enclosures, we do not need to worry much about the ansatz. We may deduce a reasonably good ansatz from the power series expansion at the previous time , and the Cauchy bounds for all . But it is perfectly reasonable as well to use

Applying a sufficient number of times to this ansatz, we obtain the desired ball enclosure of the truncated power series expansion of at time . Assuming that , we may then deduce the enclosure

(31)

of at time . Notice that the cost of the computation of certified solutions in this way is similar to the cost of Algorithm 1, up to a constant factor.

6.4.Certification of approximate solutions

The above method relies on the property that the steady-state problem (30) for balls specializes into a numerical steady-state problem that admits as a solution, since and . Given a numerical approximation of , as computed in section 5, a related problem is whether we can certify the existence of an actual solution in a small neighbourhood of the approximate solution.

In order to make this more precise, let us introduce a few more notations. Given and , consider the numeric steady-state problem

(32)

We will denote by any exact solution to this problem if such solution exists. It will be convenient to regard and as elements of , through padding with zeros. Now consider the operator defined by

Through the iterated application of to a suitable ansatz, we may obtain a numerical approximation of .

Now consider balls and , again padded with zeros. Then we have the following ball analogue of (32):

(33)

Let and denote the centers of and . Starting from a numerical approximation of , we wish to compute the truncation of a solution to (33), with the property that for all and .

In order to do this, the idea is again to use a suitable ball version of , given by

and to keep applying on the ansatz until we are sufficiently close to a fixed point. Using a similar inflation technique as in section 6.2, we finally obtain a truncated series with , or “failed”. Now for any and , we notice that this also yields . Thinking of as a compact set of series in , this means in particular that admits a fixed point , as desired.

6.5.Curbing the wrapping effect

If , then the algorithm from subsection 6.3 specializes to a classical way to compute the enclosure (31) of a solution to (1) at time as a function of the enclosure at time . However, it is well known that this method suffers from a large overestimation of the errors, due to the so-called “wrapping effect”. A well know technique to reduce this overestimation is to first compute a certified solution for the initial condition instead of and then to bound the error due to this replacement, by investigating the first variation. Let us now show how to extend this technique to general critical indices .

With in the role of and in the role of , we start with the computation of a truncated certified solution to (33). Recall that this ball solution has the property that for all . Since the initial conditions are exact, this solution is generally fairly accurate. Now for any , we may write

(34)

We next observe that

where denotes the exact solution of the following equation for the first steady-state variation:

This equation again admits a ball analogue

where is a solution to (30). We only compute a crude solution to this equation, for a crude solution to (30), both truncated to the order . These solutions can be obtained using the techniques from section 6.3. From (34) we finally conclude that

is a ball enclosure for . This enclosure is usually of a much better quality than . However, its computational cost is roughly times higher, due to the fact that we need to determine the first variation.

6.6.Handling close eigenvalues

In the case when is too small for (29) to hold, we may generalize the above strategy and incorporate some of the ideas from section 5.4. With and from there, let us briefly outline how to do this. We proceed in four stages:

7.Conclusion

There are several directions for generalizations and further improvements of the results in this paper. For simplicity, we assumed to be a diagonal matrix with positive eigenvalues. It should not be hard to adapt the results to arbitrary matrices with positive real eigenvalues only (as long as the forcing term does not explode for the change of coordinates that puts in Jordan normal form).

A more interesting generalization would be to consider complex eigenvalues with strictly positive real parts. In that case, the half disks would need to be replaced by smaller compact sectors of the form with . Even more generally, one may investigate how to develop accurate Taylor series methods for integrating arbitrary differential equations with the property that the step size is proportional to the radius of convergence of the exact solution; see also Remark 10.

In this paper, we focussed on high precision computations using high order Taylor schemes. Another interesting question is whether it is possible to develop analogues of our methods in the same spirit as more traditional Runge–Kutta schemes. How would such analogues compare to implicit integration schemes?

Concerning certified integration, the theory of Taylor models [18, 23, 19] allows for higher order expansions of the flow as a function of the initial condition (that is, beyond the computation of the first variation as in this paper). We think that it should be relatively straightforward to adapt our techniques to such higher order expansions. In a similar vein, we expect that the techniques from [7, 11] to further curb the wrapping effect can be adapted to the stiff setting.

From a more practical point of view, it would finally be interesting to have more and better machine implementations. For the moment, we only did a toy implementation of the main algorithm from section 5.1 in the Mathemagix system [12]. We found this implementation to work as predicted on various simple examples. Detailed machine experiments with larger systems and the algorithms from sections 5.4 and 6 should make it possible to further improve the new techniques.

Bibliography

[1]

A. Bostan, F. Chyzak, F. Ollivier, B. Salvy, É. Schost, and A. Sedoglavic. Fast computation of power series solutions of systems of differential equations. In Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms, pages 1012–1021. New Orleans, Louisiana, U.S.A., January 2007.

[2]

R. P. Brent and H. T. Kung. Fast algorithms for manipulating formal power series. Journal of the ACM, 25:581–595, 1978.

[3]

R. P. Brent and P. Zimmermann. Modern Computer Arithmetic. Cambridge University Press, 2010.

[4]

J. C. Butcher. Numerical methods for ordinary differential equations in the 20th century. J. of Computational and Applied Mathematics, 125:1–29, 2000.

[5]

J.-P. Eckmann, H. Koch, and P. Wittwer. A computer-assisted proof of universality in area-preserving maps. Memoirs of the AMS, 47(289), 1984.

[6]

M. J. Fischer and L. J. Stockmeyer. Fast on-line integer multiplication. Proc. 5th ACM Symposium on Theory of Computing, 9:67–72, 1974.

[7]

T. N. Gambill and R. D. Skeel. Logarithmic reduction of the wrapping effect with application to ordinary differential equations. SIAM J. Numer. Anal., 25(1):153–162, 1988.

[8]

J. van der Hoeven. Relax, but don't be too lazy. JSC, 34:479–542, 2002.

[9]

J. van der Hoeven. New algorithms for relaxed multiplication. JSC, 42(8):792–802, 2007.

[10]

J. van der Hoeven. Newton's method and FFT trading. J. Symbolic Comput., 45(8):857–878, 2010.

[11]

J. van der Hoeven. Certifying trajectories of dynamical systems. In I. Kotsireas, S. Rump, and C. Yap, editors, Mathematical Aspects of Computer and Information Sciences: 6th International Conference, MACIS 2015, Berlin, Germany, November 11-13, 2015, Revised Selected Papers, pages 520–532. Cham, 2016. Springer International Publishing.

[12]

J. van der Hoeven, G. Lecerf, B. Mourrain et al. Mathemagix. from 2002. http://www.mathemagix.org.

[13]

Joris van der Hoeven and Grégoire Lecerf. Evaluating straight-line programs over balls. In Paolo Montuschi, Michael Schulte, Javier Hormigo, Stuart Oberman, and Nathalie Revol, editors, 2016 IEEE 23nd Symposium on Computer Arithmetic, pages 142–149. IEEE, 2016.

[14]

W. Kühn. Rigourously computed orbits of dynamical systems without the wrapping effect. Computing, 61:47–67, 1998.

[15]

O. E. Lanfort. A computer assisted proof of the Feigenbaum conjectures. Bull. of the AMS, 6:427–434, 1982.

[16]

R. Lohner. Einschließung der Lösung gewöhnlicher Anfangs- und Randwertaufgaben und Anwendugen. PhD thesis, Universität Karlsruhe, 1988.

[17]

R. Lohner. On the ubiquity of the wrapping effect in the computation of error bounds. In U. Kulisch, R. Lohner, and A. Facius, editors, Perspectives on enclosure methods, pages 201–217. Wien, New York, 2001. Springer.

[18]

K. Makino and M. Berz. Remainder differential algebras and their applications. In M. Berz, C. Bischof, G. Corliss, and A. Griewank, editors, Computational differentiation: techniques, applications and tools, pages 63–74. SIAM, Philadelphia, 1996.

[19]

K. Makino and M. Berz. Suppression of the wrapping effect by Taylor model-based validated integrators. Technical Report MSU Report MSUHEP 40910, Michigan State University, 2004.

[20]

R. E. Moore. Automatic local coordinate transformations to reduce the growth of error bounds in interval computation of solutions to ordinary differential equation. In L. B. Rall, editor, Error in Digital Computation, volume 2, pages 103–140. John Wiley, 1965.

[21]

R. E. Moore. Interval Analysis. Prentice Hall, Englewood Cliffs, N.J., 1966.

[22]

A. Neumaier. The wrapping effect, ellipsoid arithmetic, stability and confedence regions. Computing Supplementum, 9:175–190, 1993.

[23]

A. Neumaier. Taylor forms - use and limits. Reliable Computing, 9:43–79, 2002.

[24]

K. Nickel. How to fight the wrapping effect. In Springer-Verlag, editor, Proc. of the Intern. Symp. on interval mathematics, pages 121–132. 1985.

[25]

W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical recipes, the art of scientific computing. Cambridge University Press, 3rd edition, 2007.

[26]

A. Sedoglavic. Méthodes seminumériques en algèbre différentielle ; applications à l'étude des propriétés structurelles de systèmes différentiels algébriques en automatique. PhD thesis, École polytechnique, 2001.