Numerical Analysis I

M.R. O’Donohoe

References:

S.D. Conte & C. de Boor, Elementary Numerical Analysis: An Algorithmic Approach, Third edition,1981. McGraw-Hill.

L.F. Shampine, R.C. Allen, Jr & S. Pruess, Fundamentals of Numerical Computing, 1997. Wiley.

David Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic, ACMComputing Surveys, Vol. 23, No. 1, March 1991§.

The approach adopted in this course does not assume a very high level of mathematics; in particular the level

is not as high as that required to understand Conte & de Boor’s book.

1. Fundamental concepts

1.1 Introduction

This course is concerned with numerical methods for the solution of mathematical problems on a computer,usually using floating-point arithmetic. Floating-point arithmetic, particularly the ‘IEEE Standard’, iscovered in some detail. The mathematical problems solved by numerical methods include differentiationand integration, solution of equations of all types, finding a minimum value of a function, fitting curves orsurfaces to data, etc. This course will look at a large range of such problems from different viewpoints, butrarely in great depth.

‘Numerical analysis’ is a rigorous mathematical discipline in which such problems, and algorithms for theirsolution, are analysed in order to establish the condition of a problem or the stability of an algorithm andto gain insight into the design of better and more widely applicable algorithms. This course contains someelementary numerical analysis, and technical terms like condition and stability are discussed, although themathematical content is kept to a minimum. The course also covers some aspects of the topic not normallyfound in numerical analysis texts, such as numerical software considerations.

In summary, the purposes of this course are:

(1) To explain floating-point arithmetic, and to describe current implementations of it.

(2) To show that design of a numerical algorithm is not necessarily straightforward, even for some simpleproblems.

(3) To illustrate, by examples, the basic principles of good numerical techniques.

(4) To discuss numerical software from the points of view of a user and of a software designer.

1.2 Floating-point arithmetic

1.2.1 Overview

Floating-point is a method for representing real numbers on a computer.

Floating-point arithmetic is a very important subject and a rudimentary understanding of it is a pre-requisitefor any modern numerical analysis course. It is also of importance in other areas of computer science:

§ A PostScript version is available as the file $CLTEACH/mro2/Goldberg.ps on PWF Linux.

1

almost every programming language has floating-point data types, and these give rise to special floating-point exceptions such as overflow. So floating-point arithmetic is of interest to compiler writers and designersof operating systems, as well as to any computer user who has need of a numerical algorithm. Sections 1.2.1,1.2.2, 1.7, 1.13 and 3.3.2 are closely based on Goldberg’s paper, which may be consulted for further detailand examples.

Until 1987 there was a lot in common between different computer manufacturers’ floating-pointimplementations, but no Standard. Since then the IEEE Standard has been increasingly adopted bymanufacturers though is still not universal. This Standard has taken advantage of the increased speed ofmodern processors to implement a floating-point model that is superior to earlier implementations becauseit is required to cater for many of the special cases that arise in calculations. The IEEE Standard givesan algorithm for the basic operations (+ − ∗ /

√) such that different implementations must produce the

same results, i.e. in every bit. (In these notes the symbol ∗ is used to denote floating-point multiplicationwhenever there might be any confusion with the symbol ×, used in representing the value of a floating-pointnumber.) The IEEE Standard is by no means perfect, but it is a major step forward: it is theoreticallypossible to prove the correctness of at least some floating-point algorithms when implemented on differentmachines under IEEE arithmetic. IEEE arithmetic is described in Section 1.13.

We first discuss floating-point arithmetic in general, without reference to the IEEE Standard.

1.2.2 General description of floating-point arithmetic

Floating-point arithmetic is widely implemented in hardware, but software emulation can be done althoughvery inefficiently. (In contrast, fixed-point arithmetic is implemented only in specialized hardware, becauseit is more efficient for some purposes.)

In floating-point arithmetic fractional values can be represented, and numbers of greatly different magnitude,e.g. 1030 and 10−30. Other real number representations exist, but these have not found widespreadacceptance to date.

Each floating-point implementation has a base β which, on present day machines, is typically 2 (binary),8 (octal), 10 (decimal), or 16 (hexadecimal). Base 10 is often found on hand-held calculators but rarelyin fully programmable computers. We define the precision p as the number of digits (of base β) held in afloating-point number. If di represents a digit then the general representation of a floating-point number is

±d0.d1d2d3 . . . dp−1 × βe

which has the value

±(d0 + d1β−1 + d2β

−2 + . . . + dp−1β−(p−1))βe

where 0 ≤ di < β.

If β = 2 and p = 24 then 0.1 cannot be represented exactly but is stored as the approximation1.10011001100110011001101× 2−4. In this case the number 1.10011001100110011001101 is the significand

(or mantissa), and −4 is the exponent. The storage of a floating-point number varies between machinearchitectures. For example, a particular computer may store this number using 24 bits for the significand, 1bit for the sign (of the significand), and 7 bits for the exponent in order to store each floating-point number in4 bytes. Two different machines may use this format but store the significand and exponent in the oppositeorder; calculations might even produce the same answers but the internal bit-patterns in each word will bedifferent.

Note also that the exponent needs to take both positive and negative values, and there are various conventionsfor storing these. However, all implementations have a maximum and minimum value for the signed exponent,

2

usually denoted by emax and emin†.

1 8 bits 23 bits

1 7 bits 24 bits ( = 6 hexadecimal digits)

sign exponent significand

sign exponent significand

IEEE single precision (e.g. Sun, Pentium) β = 2

IBM (System/370) single precision β = 16

Both of the above examples‡ have the most significant digit of the significand on the left. Some floating-pointimplementations have the most significant digit on the right, while others have sign, exponent and significandin a different order.

Although not a very realistic implementation, it is convenient to use a model such as β = 10, p = 3 toillustrate examples that apply more generally. In this format, note that the number 0.1 can be representedas 1.00 × 10−1 or 0.10 × 100 or 0.01 × 101. If the leading digit d0 is non-zero then the number is said tobe normalized; if the leading digit is zero then it is unnormalized. The normalized representation of anynumber, e.g. 1.00× 10−1, is unique.

In a particular floating-point implementation it is possible to use normalized numbers only, except that zerocannot be represented at all. This can be overcome by reserving the smallest exponent for this purpose, callit emin − 1. Then floating-point zero is represented as 1.0 × βemin−1.

1.2.3 The numerical analyst’s view of floating-point arithmetic

Ideally, in order to use classical mathematical techniques, we would like to ignore the use of floating-pointarithmetic. However, the use of such ‘approximate’ arithmetic is relevant, although the numerical analystdoes not want to be concerned with details of a particular implementation of floating-point. In this coursewe take a simplified view of machine arithmetic§. In general, arithmetic base and method of rounding are ofno concern, except to note that some implementations are better than others. The floating-point precision isimportant. We will define the parameter machine epsilon, associated with the precision of each floating-pointimplementation, and then apply the same analysis for all implementations.

Before making this powerful simplification, we consider some implications of the use of floating-pointarithmetic which will be swept under the carpet in so doing. These points often cause problems for thesoftware designer rather than the numerical analyst:

(1) Floating-point is a finite arithmetic. This is not too serious, but is worth reflecting on. On most computers,for historical reasons, an integer and a floating-point number each occupy one word of storage, i.e. thesame number of bits. Therefore there are the same number of representable numbers. For example, inIBM System/370 a word is 32 bits so there are 232 representable integers (between −231 and +231) and 232

representable floating-point numbers (between −2252 and +2252).

† The above description closely resembles the Brown model which we return to in 3.3.1.‡ IBM System/370 is an architecture that was formerly widely used for numerical computation, though less

so today. It serves in these notes as a good example to contrast with IEEE arithmetic.§ Except in Sections 1.7, 1.13 and 3.3.2 where floating-point arithmetic is discussed in detail.

3

(2) The representable floating-point numbers, under the arithmetic operations available on the computer, donot constitute a field†. It is trivial to show this by multiplying two large numbers, the product of which isnot representable (i.e. causes overflow). Less obvious deficiencies are more troublesome, e.g. in IEEE singleprecision there are representable numbers X such that 1/X is not representable (even approximately).

(3) The representable floating-point numbers are more dense near zero. We shall see that this is reasonable, butleads to complications before we get very far.

1.2.4 Overflow and underflow

For convenience we will often use the terms overflow and underflow to denote numbers that are outsidethe range that is representable. It is worth pointing out that both overflow and underflow are hazards innumerical computation but in rather different ways.

We can regard overflow as being caused by any calculation whose result is too large in absolute value tobe represented, e.g. as a result of exponentiation or multiplication or division or, just possibly, addition orsubtraction. This is potentially a serious problem: if we cannot perform arithmetic with ∞ then we musttreat overflow as an error. In a language with no exception handling there is little that can be done otherthan terminate the program with an error condition. In some cases overflow can be anticipated and avoided.In Section 1.13 we will see how IEEE arithmetic deals with overflow, and makes recovery possible in manycases.

Conversely, underflow is caused by any calculation whose result is too small to be distinguished from zero.Again this can be caused by several operations, although addition and subtraction are less likely to beresponsible. However, we can perform ordinary arithmetic using zero (unlike ∞) so underflow is less of aproblem but more insidious: often, but not always, it is safe to treat an underflowing value as zero. Thereare several exceptions. For example, suppose a calculation involving time uses a variable time step δt whichis used to update the time t (probably in a loop terminated at some fixed time) by assignments of the form

delta t := delta t ∗ some positive value

t := t + delta t

If the variable delta t ever underflows, then the calculation may go into an infinite loop. An ‘ideal’ algorithmwould anticipate and avoid such problems.

It is also worth mentioning that overflow and underflow problems can often be overcome by re-scaling acalculation to fit more comfortably into the representable number range.

1.3 Errors and machine epsilon

Suppose that x, y are real numbers not dangerously close to overflow or underflow. Let x∗ denote thefloating-point representation of x. We define the absolute error ε by

x∗ = x + ε

and the relative error δ byx∗ = x(1 + δ) = x + xδ

We can writeε = xδ

or, if x 6= 0,

δ =ε

x.

† A field is a mathematical structure in which elements of a set A obey the formal rules of ordinary arithmeticwith respect to a pair of operators representing addition and multiplication. The concepts of subtractionand division are implicit in these rules.

4

When discussing floating-point arithmetic, relative error seems appropriate because each number isrepresented to a similar relative accuracy. However, there is a problem when x = 0 or x is very closeto 0 so we will need to consider absolute error as well.

When we write x∗ = x(1 + δ) it is clear that δ depends on x. For any floating-point implementation, theremust be a number u such that |δ| ≤ u, for all x (excluding x values very close to overflow or underflow). Thenumber u is usually called the unit round off. On most computers 1∗ = 1. The smallest positive εm suchthat

(1 + εm)∗ > 1

is called machine epsilon or macheps. It is often assumed that u ' macheps .

Let ω represent any of the arithmetic operations + − ∗/ on real numbers. Let ω ∗ represent the equivalentfloating-point operation. We naturally assume that

xω∗y ' xωy.

More specifically, we assume that

xω∗y = (xωy)(1 + δ) (1.1)

for some δ (|δ| ≤ u). Equation (1.1) is very powerful and underlies backward error analysis which allows us touse ordinary arithmetic while considering the data to be ‘perturbed’. That is to say, approximate arithmeticapplied to correct data can be thought of as correct arithmetic applied to approximate data. The latter ideais easier to deal with algebraically and leads to less pessimistic analyses.

1.4 Error analysis

It is possible to illustrate the ‘style’ of error analyses by means of a very simple example. Consider theevaluation of a function f(x) = x2. We would like to know how the error grows when a floating-pointnumber is squared.

Forward error analysis tells us about worst cases. In line with this pessimistic view we will assume that allfloating-point numbers are inaccurately represented with approximately the same relative error, but againassume that numbers are not close to overflow or underflow.

Forward error analysis

We express the relative error in x by

x∗ = x(1 + δ).

Squaring both sides gives

(x∗)2 = x2(1 + δ)2

= x2(1 + 2δ + δ2)

' x2(1 + 2δ)

since δ2 is small, i.e. the relative error is approximately doubled.

Backward error analysis

To avoid confusion we now use ρ to denote the relative error in the result, i.e. we can write

[f(x)]∗ = x2(1 + ρ) (1.2)

5

such that |ρ| ≤ u‡. As ρ is small, 1+ρ > 0 so there must be another number ρ such that (1+ρ)2 = 1+ρwhere |ρ| < |ρ| ≤ u. We can now write

[f(x)]∗ = x2(1 + ρ)2

= f{x(1 + ρ)}.

We can interpret this result by saying that the error in squaring a floating-point number is no worsethan accurately squaring a close approximation to the number.

These results are dramatically different ways of looking at the same process. The forward error analysis tellsus that squaring a number causes loss of accuracy. The backward error analysis suggests that this is notsomething we need to worry about, given that we have decided to use floating-point arithmetic in the firstplace. There is no contradiction between these results: they accurately describe the same computationalprocess from different points of view.

Historically, forward error analysis was developed first and led to some very pessimistic predictions abouthow numerical algorithms would perform for large problems. When backward error analysis was developedin the 1950s by J.H. Wilkinson, the most notable success was in the solution of simultaneous linear equationsby Gaussian elimination (see Section 2.3).

We will now investigate how errors can build up using the five basic floating-point operations: + − ∗/ ↑,where ↑ denotes exponentiation. Again it would be convenient to use ordinary arithmetic but consider thefollowing problem: suppose numbers x and y are exactly represented in floating-point but that the resultof computing x ∗ y is not exactly represented. We cannot explain where the error comes from withoutconsidering the properties of the particular implementation of floating-point arithmetic.

As an example, consider double precision in IBM System/370 arithmetic. The value of macheps isapproximately 0.22 × 10−15. For the sake of simplicity we will assume that all numbers are representedwith the same relative error 10−15.

We can deal most easily with multiplication. We write

x∗

1 = x1(1 + δ1)

x∗

2 = x2(1 + δ2)

Thenx∗

1 × x∗

2 = x1x2(1 + δ1)(1 + δ2)

= x1x2(1 + δ1 + δ2 + δ1δ2)

Ignoring δ1δ2 because it is small, the worst case is when δ1 and δ2 have the same sign, i.e. the relative errorin x∗

1 × x∗

2 is no worse than |δ1| + |δ2|. Taking the IBM example, if we perform one million floating-pointmultiplications then at worst the relative error will have built up to 10 6.10−15 = 10−9.

We can easily dispose of division by using the binomial expansion to write

1/x∗

2 = (1/x2)(1 + δ2)−1 = (1/x2)(1 − δ2 + ...).

Then, by a similar argument, the relative error in x∗

1/x∗

2 is again no worse than |δ1| + |δ2|.

We can compute x∗

1 ↑ n, for any integer n by repeated multiplication or division. Consequently we can arguethat the relative error in x∗

1 ↑ n is no worse than n|δ1|.

This leaves addition and subtraction. Consider

x∗

1 + x∗

2 = x1(1 + δ1) + x2(1 + δ2)

= x1 + x2 + (x1δ1 + x2δ2)

= x1 + x2 + (ε1 + ε2)

‡ Equation (1.2) comes directly from (1.1) by setting x = y, δ = ρ and treating ω as ‘multiply’.

6

where ε1, ε2 are the absolute errors in representing x1, x2 respectively. The worst case is when ε1 and ε2

have the same sign, i.e. the absolute error in x∗

1 + x∗

2 is no worse than |ε1| + |ε2|.

Using the fact that (−x2)∗ = −x2 − ε2 we get that the absolute error in x∗

1 − x∗

2 is also no worse that|ε1| + |ε2|.

The fact that error build-up in addition and subtraction depends on absolute accuracy, rather thanrelative accuracy, leads to a particular problem. Suppose we calculate

√10 − π using a computer with

macheps ' 10−6, e.g. IBM System/370 single precision.

√10 = 3.16228

π = 3.14159√

10− π = 0.02069

Because these numbers happen to be of similar size the absolute error in representing each is around 3×10−6.We expect that the absolute error in the result is about 6 × 10−6 at worst. But as

x∗

1 − x∗

2 = x1 − x2 + (ε1 − ε2)

the relative error in x∗

1 − x∗

2 isε1 − ε2

x1 − x2

which, in this case, turns out to be about 3× 10−4. This means that the relative error in the subtraction isabout 300 times as big as the relative error in x1 or x2.

This problem is known as loss of significance. It can occur whenever two similar numbers of equal signare subtracted (or two similar numbers of opposite sign are added), and is a major cause of inaccuracy infloating-point algorithms. For example, if a calculation is being performed using, say, macheps = 10 −15 butone critical addition or subtraction causes a loss of significance with relative error, say, 10 −5 then the relativeerror achieved could be only 10−5.

However, loss of significance can be quite a subtle problem as the following two examples illustrate. Considerthe function sinx which has the series expansion

sinx = x − x3

3!+

x5

5!− ...

which converges, for any x, to a value in the range −1 ≤ sinx ≤ 1.

If we attempt to sum this series of terms with alternating signs using floating-point arithmetic we cananticipate that loss of significance will be a problem. Using a BBC micro (macheps ' 2 × 10 −10) thefollowing results may be obtained.

Example 1

Taking x = 2.449 the series begins

sinx = 2.449− 2.448020808 + 0.7341126024− ...

and ‘converges’ after 11 terms to 0.6385346144, with a relative error barely larger than macheps. Theobvious loss of significance after the first subtraction does not appear to matter.

Example 2

Taking x = 20 the series begins

sinx = 20− 1333.333333 + 26666.66667− ....

7

The 10th term, −43099804.09 is the largest and the 27th term is the first whose absolute value is lessthan 1. The series ‘converges’ after 37 terms to 0.9091075368 whereas sin 20 = 0.9129452507, givinga relative error of about 4 × 10−3 despite the fact that no individual operation causes such a greatloss of significance.

It is important to remember, in explaining these effects, that a sequence of addition/subtraction operationsis being performed.

In Example 1 the evaluation of 2.449−2.448020808 causes loss of significance, i.e. the relative error increasesby at least 1000 times although the absolute error is at worst doubled. Adding in the next term, 0.7341126024,again makes little difference to the absolute error but, because the new term is much larger, the relativeerror is reduced again. Overall, the relative error is small.

In Example 2 the loss of significance is caused by the fact that the final result is small compared withnumbers involved in the middle of the calculation. The absolute error in the result is the sum of the absoluteerror in each addition/subtraction. Calculations involving the largest term will contribute an absolute errorof nearly 10−2 so we could estimate the final relative error to be approximately 10−2/0.91, which is slightlylarger than the observed error.

Exercise 1a

In any language with floating-point arithmetic, write a program to evaluate

n∑

j=1

xj

(a) by summing over increasing values of j, and (b) by summing over decreasing values of j. Setn = 100000 and

x1 = 1,

xj = 1000ε/j, j = 2, 3, . . . n

where ε ' macheps. (You will need to discover, from documentation or otherwise, an approximatevalue of macheps for the precision that you are using.) Explain why the results of (a) and (b) aredifferent.

1.5 Solving quadratics

Solving the quadratic equation ax2 + bx + c = 0 appears, on the face of it, to be a very elementary problem.It is also a very important one, principally because it is often encountered as part of a larger calculation.Particularly important is the use of quadratic equations in matrix computations in which, typically, severalthousand quadratics might need to be solved in a straightforward calculation.

Such applications require an algorithm for solution of a quadratic equation that is robust in the sense that itwill not fail or give inaccurate answers for any reasonable representable coefficients a, b and c. We will nowinvestigate how easy this is to achieve.

It is well known that the solution of ax2 + bx + c = 0 can be expressed by the formula

x =−b ±

√b2 − 4ac

2a.

A problem arises if b2 >> |4ac| in which case one root is small. Suppose b > 0 so that the small root is givenby

x =−b +

√b2 − 4ac

2a(1.3)

8

with loss of significance in the numerator. The problem can be averted by writing

x =−b +

√b2 − 4ac

2a.−b−

√b2 − 4ac

−b−√

b2 − 4ac

and simplifying to get

x =−2c

b +√

b2 − 4ac(1.4)

so that the similar quantities to be summed are now of the same sign. Taking a = 1, b = 100000, c = 1 asan example and using a BBC micro again (macheps ' 2 × 10−10) we get

x = −1.525878906× 10−5 using(1.3)

x = −1.000000000× 10−5 using(1.4)

and the latter is as accurate as the machine precision will allow.

A robust algorithm must use equation (1.3) or (1.4) as appropriate in each case.

The solution of quadratic equations illustrates that even the simplest problems can present numericaldifficulties although, if anticipated, the difficulties may be circumvented by adequate analysis.

Exercise 1b

The Bessel functions J0(x), J1(x), J2(x), . . . satisfy the recurrence formula

Jn+1(x) = (2n/x)Jn(x) − Jn−1(x)

On a certain computer, when x = 2/11, the first two Bessel functions have the approximate values

J0(x) = 0.991752

J1(x) = 0.0905339

where there is known to be an error of about 0.5 in the last digit. Assuming 2/x evaluates to 11 exactly,and using six significant decimal digits, calculate J 2(x) and J3(x) from the formula and estimate theapproximate relative error in each. How accurately can J4(x) be calculated? (N.B. This exercise isdesigned so that the arithmetic is very simple, and a computer or calculator is unnecessary.)

When x = 20, usingJ0(x) = 0.167024

J1(x) = 0.0668331

J4(x) can be evaluated to the same relative accuracy as J0(x) and J1(x). How do you account forthis?

1.6 Convergence and error testing

An iterative numerical process is one in which a calculation is repeated to produce a sequence of approximatesolutions. If the process is successful, the approximate solutions will converge.

Convergence of a sequence is usually defined as follows. Let x0, x1, x2, ... be a sequence (of approximations)and let x be a number. We define

εn = xn − x.

The sequence converges iflim

n→∞

εn = 0

9

for some number x, which is called the limit of the sequence. The important point about this definition isthat convergence of a sequence is defined in terms of absolute error. In a numerical algorithm we cannotreally test for convergence as it is an infinite process, but we can check that the error is getting smaller, andconsequently that convergence is likely. Writers of numerical software tend to talk loosely of ‘convergencetesting’, by which they usually mean error testing. Some subtlety is required in error testing, as we will nowsee.

For simplicity, we now drop the suffix n.

Suppose that a numerical algorithm is being designed for a problem with a solution x. Let x be anapproximation to x, and let ε be an estimate of the absolute error in x, i.e.

x ' x + ε.

Note that, for a typical problem, it is unlikely that we can find the exact absolute error, otherwise we wouldbe able to calculate the exact solution from it.

Assume that a target absolute accuracy εt is specified. Then we could use a simple error test in which thecalculation is terminated when

|ε| ≤ εt. (1.5)

Consider floating-point arithmetic with macheps ' 10−16. If x is large, say 1020, and εt = 10−6 then ε isnever likely to be much less than 104, so condition (1.5) is unlikely to be satisfied even when the processconverges.

Despite the definition of convergence it appears that relative error would be better, although it is unsafe toestimate relative error in case x = 0. Writing δ t for target relative error, we could replace (1.5) with theerror test

|ε| ≤ δt|x|. (1.6)

In this case, if |x| is very small then δt|x| may underflow§ and then test (1.6) may never be satisfied (unlessε is exactly zero).

As (1.5) is useful when (1.6) is not, and vice versa, so-called mixed error tests have been developed. Inthe simplest form of such a test, a target error ηt is prescribed and the calculation is terminated when thecondition

|ε| ≤ ηt(1 + |x|) (1.7)

is satisfied. If |x| is small ηt may be thought of as target absolute error, or if |x| is large ηt may be thoughtof as target relative error.

A test like (1.7) is used in much modern numerical software, but it solves only part of the problem. We alsoneed to consider how to estimate ε. The simplest formula is

εn = xn − xn−1 (1.8)

where we again use the subscript n to denote the nth approximation. Suppose an iterative method is used

§ Note that the treatment of underflowed values may depend on both the floating-point architecture and theprogramming language in use.

10

to find the positive root of the equation x4 = 16 with successive approximations

x0 = 4

x1 = 0.5

x2 = 2.25

x3 = 2.25

x4 = 1.995

x5 = 2.00001

x6 = 1.999999998

....

If (1.8) is used then the test (1.7) will cause premature termination of the algorithm with the incorrect answer2.25. In case this contrived example is thought to be unlikely to occur in practice, theoretical research hasshown that such cases must always arise for a wide class of numerical methods. A safer test is given by

εn = |xn − xn−1| + |xn−1 − xn−2| (1.9)

but again research has shown that xn−2, xn−1 and xn can all coincide for certain methods so (1.9) is notguaranteed to work. In many problems, however, confirmation of convergence can be obtained independently:e.g. in the example above it can be verified by computation that 2.25 does not satisfy the equation x 4 = 16.

Another important problem associated with ‘convergence detection’ is that the ‘granularity’ of a floating-point representation leads to some uncertainty† as to the exact location of, say, a zero of a function f(x).Suppose that the mathematical function f(x) is smooth and crosses the x axis at some point a. Suppose alsothat f(x) is evaluated on a particular computer for every representable number in a small interval aroundx = a. The graph of these values, when drawn at a suitably magnified scale might appear as follows.

If we state that f(a) = 0 then there is uncertainty as to the value of a. Alternatively, if we fix a pointa, then there is uncertainty in the value of f(a) unless a has an exact representation in floating-point.Note particularly, that a 1-bit change in the representation of x might produce a change of sign in therepresentation of f(x).

This problem is easily avoided by specifying a target error that is not too close to macheps.

1.7 Rounding error in floating-point arithmetic

The term infinite precision is used below to denote the result of a basic floating point operation (+ − ∗/)as if performed exactly and then rounded to the available precision. Note that a calculation can actually beperformed in this way for the operations + − and ∗, but not for / as the result of a division may require

† Physicists may note that there is an analogy here with quantum physics and uncertainty principles.

11

infinitely many digits to store. However, the end result of infinite precision division can be computed by afinite algorithm.

Suppose β = 10 and p = 3. If the result of a floating-point calculation is 3.12 × 10−2 when the result toinfinite precision is 3.14 × 10−2, then there is said to be an error of 2 units in the last place. The acronymulp (plural ulps) is often used for this. When comparing a real number with a machine representation, itis usual to refer to fractions of an ulp. For example, if the number π (whose value is the non-terminatingnon-repeating decimal 3.1415926535 . . .) is represented by the floating-point number 3.14 × 10 0 then theerror may be described as 0.15926535 . . . ulp. The maximum error in representing any accurately knownnumber (except near to underflow or overflow) is 1/2 ulp.

The numerical analyst does not want to be bothered with concepts like ‘ulp’ because the error in a calculation,when measured in ulps, varies in a rather haphazard way and differs between floating-point architectures.The numerical analyst is more interested in relative error. An interesting question is: how does the relativeerror in an operation vary between different floating-point implementations? This problem is ‘swept underthe carpet’ in most numerical analysis. Subtraction of similar quantities is a notorious source of error. If afloating-point format has parameters β and p, and subtraction is performed using p digits, then the worstrelative error of the result is β − 1, i.e. base 16 can be 15 times less accurate than base 2. This occurs, forexample, in the calculation

1.000 . . .0 × βe − 0.ρρρ . . . ρ × βe

where ρ = β−1 if the least significant ρ is lost during the subtraction. This property implies that the smallerthe base the better, and that β = 2 is therefore optimal in this respect. Furthermore base 16, favoured byIBM, is a rather disadvantageous choice.

To see how this works in practice, compare the two representations β = 16, p = 1 and β = 2, p = 4. Bothof these require 4 bits of significand. The worst case for hexadecimal is illustrated by the representation of15/8. In binary, 15 is represented by 1.111 × 23 so that 15/8 is 1.111 × 20. In hexadecimal the digits arenormally written as

0 1 2 3 4 5 6 7 8 9 A B C D E F

so the number 15 is represented simply by F × 160 whereas 15/8 can be represented no more accuratelythan 1 × 160‡. Although this example is based on 4-bit precision, it is generally true that 3 bits of accuracycan be lost in representing a number§ in base 16.

A relatively cheap way of improving the accuracy of floating-point arithmetic is to make use of extra digits,called guard digits, in the computer’s arithmetic unit only. If x and y are numbers represented exactly on acomputer and x − y is evaluated using just one guard digit, then the relative error in the result is less than2 × macheps. Although this only applies if x and y are represented exactly, this property can be used toimprove expression evaluation algorithms.

In this Section the rather loosely defined term mathematically equivalent is used to refer to differentexpressions or formulae that are equivalent in true real arithmetic.

For example, consider the evaluation of x2 − y2 when x ' y, given exactly known x and y. If x2 or y2

is inexact when evaluated then loss of significance will usually occur when the subtraction is performed.However, if the (mathematically equivalent) calculation (x+ y)(x− y) is performed instead then the relativeerror in evaluating x − y is less than 2 × macheps provided there is at least one guard digit, because x andy are exactly known. Therefore serious loss of significance will not occur in the latter case.

Careful consideration of the properties of floating-point arithmetic can lead to algorithms that make littlesense in terms of real arithmetic. Consider the evaluation of ln(1+x) for x small and positive. Suppose thereexists a function LN(y) that returns an answer accurate to less than 1/2 ulp if y is accurately represented. If

‡ If the division is performed by repeated subtraction using p digits only.§ There is a compensating advantage in using base 16: the exponent range is greater, but this is not really as

useful as increased accuracy.

12

x < macheps then 1+x evaluates to 1 and LN(1) will presumably return 0. A more accurate approximationin this case is ln(1 + x) ' x. However there is also a problem if x > macheps but x is still small, becauseseveral digits of x will be lost when it is added to 1. In this case, ln(1+x) ' LN(1+x) is less accurate thanthe (mathematically equivalent) formula

ln(1 + x) ' x ∗ LN(1 + x)/((1 + x) − 1).

Indeed it can be proved that the relative error is at most 5 × macheps for x < 0.75, provided at least oneguard digit is used†.

The cost of a guard digit is not high in terms of processor speed: typically 2% per guard digit for a doubleprecision adder. However the use of guard digits is not universal, e.g. Cray supercomputers did not havethem.

The idea of rounding is fairly straightforward. Consider base 10, as it is familiar, and assume p = 3. Thenumber 1.234 should be rounded to 1.23 and 1.238 should be rounded to 1.24. Most computers performrounding, although it is only a user-specifiable option on some architectures. Curiously, IBM System/370arithmetic has no rounding and merely chops off extra digits.

The only point of contention is what to do about rounding in half-way cases: should 1.235 be rounded to1.23 or to 1.24? One method is to specify that the digit 5 rounds up, on the grounds that half of the digitsround down and the other half round up. Unfortunately this introduces a slight bias§.

An alternative is to use ‘round to even’: the number 1.235 rounds to 1.24 because the digit 4 is even, whereas1.265 rounds to 1.26 for the same reason. The slight bias of the above method is removed. Reiser & Knuth(1975) highlighted this by considering the sequence

xn = (xn−1 − y) + y

where n = 0, 1, 2, . . . and the calculations are performed in floating point arithmetic. Using ‘round to even’they showed that xn = x1 for all n, for any starting values x0 and y. This is not true of the ‘5 rounds up’method. For example, let β = 10, p = 3 and choose x0 = 1.00 and y = −5.55 × 10−1. This leads to thesequence x1 = 1.01, x2 = 1.02, x3 = 1.03, . . . with 1 ulp being added at each stage. Using ‘round to even’,xn = 1.00 for all n.

It is also worth mentioning that probabilistic rounding has been suggested, i.e. rounding half-way cases upor down at random. This is also unbiased and has the advantage of reducing the overall standard deviationof errors, so could be considered more accurate, but it has the disadvantage that repeatability of calculationsis lost. It is not implemented on any modern hardware.

Finally, it is often argued that many of the disadvantages of floating-point arithmetic can be overcome byusing arbitrary (or variable) precision rather than a fixed precision. While there is some truth in this claim,efficiency is important and arbitrary precision algorithms are very often prohibitively expensive. This subjectis not discussed here.

Exercise 1c

Explain the term infinite precision. Consider a floating-point implementation with β = 10, p = 2.What are the infinite precision results of the calculations

(a) (1.2 × 101) ∗ (1.2 × 101) (b) (2.0 × 100)/(3.0 × 100)

How many decimal guard digits are required in the significand of the arithmetic unit to calculate (b)to infinite precision?

† See Goldberg’s paper for an explanation of how this works.§ The bias is more serious in binary where ‘1 rounds up’.

13

1.8 Norms

Up to now we have discussed only scalar problems, i.e. problems with a single number x as solution. Themore important practical numerical problems often have a vector x = {x 1, x2, ...xn} of solutions‡. Althoughthis is a major complication, it is surprising how often a numerical method for an n-dimensional problemcan be derived from a scalar method simply by making the appropriate generalization in the algebra. Whenprogramming a numerical algorithm for a vector problem it is still desirable to use an error test like (1.7)but it is necessary to generalize this by introducing a measurement of the size of a vector.

In order to compare the size of vectors we need a scalar number, called the norm, which has propertiesappropriate to the concept of ‘size’:

(1) The norm should be positive, unless the vector is {0, 0, ...0} in which case it should be 0.

(2) If the elements of the vector are made systematically smaller or larger then so should the norm.

(3) If two vectors are added or subtracted then the norm of the result should be suitably related to thenorms of the original vectors.

The norm of a vector x is usually written ‖x‖ and is any quantity satisfying the three axioms:

(1) ‖x‖ ≥ 0 for any x. ‖x‖ = 0 ⇐⇒ x = 0 = {0, 0, . . .0}.

(2) ‖cx‖ = |c|.‖x‖ where c is any scalar.

(3) ‖x + y‖ ≤ ‖x‖ + ‖y‖ where x and y are any two vectors.

The norms most commonly used are the so-called lp norms in which, for any real number p ≥ 1,

‖x‖p ={

n∑

i=1

|xi|p}

1p

.

In practice, only the cases p = 1 or 2 or limp→∞ ‖x‖p are used. The relevant formulae are:

‖x‖1 =

n∑

i=1

|xi| l1 norm

‖x‖2 =

√

√

√

√

n∑

i=1

x2i l2 norm

‖x‖∞ = maxi

|xi| l∞ norm

Alternative names for the l2 norm are: Euclidean norm, Euclidean length, length, mean.Alternative names for the l∞ norm are: max norm, uniform norm, Chebyshev norm.

Consider the following example from an iterative process with a vector of solutions:

x1 = {1.04, 2.13, 2.92,−1.10}x2 = {1.05, 2.03, 3.04,−1.04}x3 = {1.01, 2.00, 2.99,−1.00}.

This appears to be converging to something like {1, 2, 3,−1}, but we need to be able to test for this in aprogram. If we use, for example, a vector generalization of (1.8), i.e. en = xn − xn−1 then we get

e2 = {0.01,−0.10, 0.12, 0.06}e3 = {−0.04,−0.03,−0.05, 0.04}.

‡ Note that xn does not have the same meaning as in the last section.

14

The norms of these error vectors are as follows

l1 l2 l∞e2 0.29 0.17 0.12e3 0.16 0.08 0.05

Note that in each case ‖e3‖ < ‖e2‖ so it does not matter, for ‘convergence testing’ purposes, which norm isused.

A simple vector generalization of the error test (1.7) is

‖e‖ ≤ ηt(1 + ‖x‖)

where ηt is the prescribed (scalar) error and any suitable norm is used. (Of course, the same norm must beused for both ‖e‖ and ‖x‖ so that they can be compared.)

1.9 Condition of a problem

The condition of a numerical problem is a qualitative or quantitative statement about how easy it is to solve,irrespective of the algorithm used to solve it.

As a qualitative example, consider the solution of two simultaneous linear equations. The problem may bedescribed graphically by the pair of straight lines representing each equation: the solution is then the pointof intersection of the lines. (The method of solving a pair of equations by actually drawing the lines withruler and pencil on graph paper and measuring the coordinates of the solution point may be regarded as analgorithm that can be used.) Two typical cases are illustrated below:

well conditioned ill conditioned

The left-hand problem is easier to solve than the right hand one, irrespective of the graphical algorithm used.For example, a better (or worse) algorithm is to use a sharper (or blunter) pencil: but in any case it shouldbe possible to measure the coordinates of the solution more exactly in the left-hand case than the right.

Quantitatively, the condition K of a problem is a measure of the sensitivity of the problem to a smallperturbation. For example (and without much loss of generality) we can consider the problem of evaluatinga differentiable function f(x). Let x be a point close to x. In this case K is a function of x defined as therelative change in f(x) caused by a unit relative change in x. That is

K(x) = limx→x

|[f(x) − f(x)]/f(x)||(x − x)/x|

=

x

f(x)

limx→x

|f(x) − f(x)||x − x|

=

x.f ′(x)

f(x)

15

from the definition of f ′(x).

Example 3

Suppose f(x) =√

x. We get

K(x) =

x.f ′(x)

f(x)

=x.[ 12/

√x]√

x=

1

2.

So K is a constant which implies that taking square roots is equally well conditioned for all x non-negative, and that the relative error is reduced by half in the process.

Example 4

Suppose now that f(x) = 11−x

. In this case we get

K(x) =

x.f ′(x)

f(x)

=

x[1/(1 − x)2]

1/(1− x)

=

x

1 − x

.

This K(x) can get arbitrarily large for values of x close to 1 and can be used to estimate the relativeerror in f(x) for such values, e.g. if x = 1.000001 then the relative error will increase by a factor ofabout 106.

Exercise 1d

Examine the condition of the problem of evaluating cosx.

1.10 Stability of an algorithm

If we represent the evaluation of a function by the rather informal graph

x f(x)

then we can represent an algorithm as a sequence of functions

1 1f (x )

2 2f (x )

3 3f (x )

x43x

2x

1x

x

f(x)

where x = x0 and xn = f(x). An algorithm is a particular method for solving a given problem, in this casethe evaluation of a function.

Alternatively, an algorithm can be thought of as a sequence of problems, i.e. a sequence of functionevaluations. In this case we consider the algorithm for evaluating f(x) to consist of the evaluation ofthe sequence x1, x2, . . . xn. We are concerned with the condition of each of the functions f1(x1), f2(x2),. . . fn−1(xn−1) where f(x) = fi(xi) for all i.

16

An algorithm is unstable if any fi is ill-conditioned, i.e. if any f i(xi) has condition much worse than f(x).Consider the example

f(x) =√

x + 1 −√

x

so that there is potential loss of significance when x is large. Taking x = 12345 as an example, one possiblealgorithm is

x0: = x = 12345

x1: = x0 + 1

x2: =√

x1

x3: =√

x0

f(x): = x4: = x2 − x3

(1.10)

The loss of significance occurs with the final subtraction. We can rewrite the last step in the formf3(x3) = x2 − x3 to show how the final answer depends on x3. As f ′

3(x3) = −1 we have the condition

K(x3) =

x3f′

3(x3)

f3(x3)

=

x3

x2 − x3

from which we find K(x3) ' 2.2 × 104 when x = 12345. Note that this is the condition of a subproblem

arrived at during the algorithm.

To find an alternative algorithm we write

f(x) = (√

x + 1 −√

x)

√x + 1 +

√x√

x + 1 +√

x

=1√

x + 1 +√

x

This suggests the algorithmx0: = x = 12345

x1: = x0 + 1

x2: =√

x1

x3: =√

x0

x4: = x2 + x3

f(x): = x5: = 1/x4

(1.11)

In this case f3(x3) = 1/(x2 + x3) giving a condition for the subproblem of

K(x3) =

x3f′

3(x3)

f3(x3)

=

x3

x2 + x3

which is approximately 0.5 when x = 12345, and indeed in any case where x is much larger than 1.

Thus algorithm (1.10) is unstable and (1.11) is stable for large values of x. In general such analyses arenot usually so straightforward but, in principle, stability can be analysed by examining the condition of asequence of subproblems.

Exercise 1e

Suppose that a function ln is available to compute the natural logarithm of its argument. Considerthe calculation of ln(1 + x), for small x, by the following algorithm

x0 := x

x1 := x0 + 1

f(x) := x2 := ln(x1)

17

By considering the condition K(x1) of the subproblem of evaluating ln(x1), show that such a functionln is inadequate for calculating ln(1 + x) accurately.

1.11 Order of convergence

Let x0, x1, x2, . . . be a sequence of successive approximations in the solution of a numerical problem. Wedefine the absolute error in the nth approximation by

xn = x + εn.

If there exist constants p and C such that

limn→∞

εn+1

εpn

= C

then the process has order of convergence p, where p ≥ 1. This is often expressed as

|εn+1| = O(|εn|p)

using the O-notation. We can say that order p convergence implies that

|εn+1| ' C|εn|p

for sufficiently large n. Obviously, for p = 1 it is required that C < 1, but this is not necessary for p > 1.

Various rates of convergence are encountered in numerical algorithms. Although not exhaustive, the followingcategorization is useful.

p = 1: linear convergence. Each interation produces the same reduction in absolute error. This isgenerally regarded as being too slow for practical methods.

p = 2: quadratic convergence. Each iteration squares the absolute error, which is very satisfactoryprovided the error is small. This is sometimes regarded as the ‘ideal’ rate of convergence.

1 < p < 2: superlinear convergence. This is not as good as quadratic but, as the reduction inerror increases with each interation, it may be regarded as the minimum acceptable rate for a usefulalgorithm.

p > 2. Such methods are unusual, mainly because they are restricted to very smooth functions andoften require considerably more computation than quadratic methods.

Exponential rate of convergence. This is an interesting special case.

1.12 Computational complexity

The ideal algorithm should not only be stable, and have a fast rate of convergence, but should also have areasonable computational complexity. It is possible to devise a stable algorithm that has, say, a quadraticrate of convergence but is still too slow to be practical for large problems. Suppose that a problem involvescomputations on a matrix of size N ×N . If the computation time increases rapidly as N increases then thealgorithm will be unsuitable for calculations with large matrices.

Suppose that some operation, call it �, is the most expensive in a particular algorithm. If the time spentby the algorithm may be expressed as O[f(N)] operations of type �, then we say that the computational

complexity is f(N).

In matrix calculations, the most expensive operations are multiplication and array references. For thispurpose the operation � may be taken to be the combination of a multiplication and one or more array

18

references. Useful matrix algorithms typically have computational complexity N 2, N3 or N4. Theseeffectively put limits on the sizes of matrices that can be dealt with in a reasonable time. As an example,consider multiplication of matrices A = (a ij) and B = (bij) to form a product C = (cij). This requires N 2

sums of the form

cij =N

∑

k=1

aik.bkj

each of which requires N multiplications (plus array references). Therefore the computational complexity isN2.N = N3.

Note that operations of lower complexity do not change the overall computational complexity of an algorithm.For example, if an N 2 process is performed each time an N 3 process is performed then, because

O(N2) + O(N3) = O(N3)

the overall computational complexity is still N 3.§

1.13 The IEEE Floating-point Standards

There are in fact two IEEE Standards, but in this section it is only occasionally necessary to distinguishbetween them as they are based on the same principles.

The Standard known as IEEE 754 requires that β = 2 and specifies the precise layout of bits in both single

precision (p = 24) and double precision (p = 53).

The other Standard, IEEE 854, is more general and allows either β = 2 or β = 10. The values of p forsingle and double precision are not exactly specified, but constraints are placed on allowable values for them.Base 10 was included mainly for calculators, for which there is some advantage in using the same numberbase for the calculations as for the display. However, bases 8 and 16 are not supported by the Standard.

IEEE arithmetic has been implemented by several manufacturers, e.g. by Sun, but not by all. Someimplementations are half-hearted, i.e. they contravene the Standard in important ways. The IEEE 754Standard made several features optional, which has meant they have been widely ignored; these features arelargely ignored here also.

If binary is used, one extra bit of significance can be gained because the first bit of the significand of anormalized number must always be 1, so need not be stored. Such implementations are said to have a hidden

bit. Note that this device does not work for any other base.

IEEE 754 single precision uses 1 bit for the sign (of the significand), 8 bits for the exponent, and 23 bitsfor the significand. However, p = 24 because a hidden bit is used. Zero is represented by an exponent ofemin−1 and a significand of all zeros. There are other special cases also, as discussed below. Single precisionnumbers are designed to fit into 32 bits and double precision numbers, which have both a higher precision†and an extended exponent range, are designed to fit into 64 bits. IEEE 754 also defines ‘double extended’precision, but this is only specified in terms of minimum requirements: it is intended for cases where a littleextra precision is essential in a calculation. The table below summarizes these precisions.

§ If, however, the N2 process was performed N 2 times each time the N 3 process was performed then thecomputational complexity would be N 2.N2 = N4.

† Note that double precision has more than double the precision of single precision.

19

Single Double Double Extended

p 24 53 ≥ 64

emax +127 +1023 ≥ +16383

emin −126 −1022 ≤ −16382

exponent width (bits) 8 11 ≥ 15

total width (bits) 32 64 ≥ 79

Note that ‘double extended’ is sometimes referred to as ‘80-bit format’, even though it only requires 79 bits.This is for the extremely practical reason that ‘double extended’ is often implemented by software emulation,rather than hardware, so it is safer to assume that the ‘hidden bit’ might not be hidden after all!

The IEEE Standard uses 1 bit for the sign of the significand (0 positive, 1 negative) and the remaining bitsfor its magnitude‡.

Exponents are stored using a biased representation. If e is the exponent of a floating-point number, then thebit-pattern of the stored exponent has the value e + emax when it is interpreted as an unsigned integer.

Under IEEE, the operations + − ∗ / must all be performed accurately, i.e. as in infinite precision, roundedto the nearest correct result, using ‘round to even’. This can be implemented efficiently, but requires aminimum of two guard digits plus one extra bit, i.e. three guard digits in the binary case.

The IEEE Standard specifies that the square root and remainder operations must be correctly rounded, andalso conversions between integer and floating-point types. The conversion between internal floating-pointformat and decimal for display purposes must be done accurately, apart from numbers close to overflow orunderflow.

The IEEE Standard does not specify that transcendental functions be exactly rounded for two reasons:(1) correct implementation of ‘round to even’ can in principle require an arbitrarily large amount ofcomputation, and (2) no algorithmic definitions were found to be suitable across all machine ranges.

It has been pointed out that the IEEE Standard has failed to specify how inner products of the form

n∑

i=1

xi.yi

are calculated, as incorrect answers can result if extra precision is not used. It would have been feasible tostandardize the calculation of inner products, and this is a regrettable omission.

Some older floating-point representations, e.g. that used on IBM System/370 mainframes, treat every bitpattern as a valid floating-point number. In some programming languages, e.g. Standard Fortran 77, therewas no exception handling, so there was little prospect of recovering from certain kinds of numerical error. Forexample, using strictly Standard Fortran 77 on the above computer, and real arithmetic, calling the squareroot function with the argument −4 was catastrophic. An error message was printed and the programstopped; an option did allow the program to continue after such an error, and the square root function thenhad to return some valid floating-point result — but none of them was correct!

The IEEE Standard attempts to tackle this problem without assuming that exception handling is available.Special values (all with exponents emax + 1 or emin − 1) are reserved for the special quantities ±0, ±∞,

‡ The ‘sign/magnitude’ method was preferred to the alternative ‘2s-complement’ method in which thesignificand is represented by the smallest non-negative number that is congruent to it modulo 2 p.

20

denormal numbers, and the concept of NaN (Not a Number). There are two (signed) values of zero,although IEEE is at pains to specify that they are indistinguishable in normal arithmetic. Operations thatoverflow yield ±∞, and operations that underflow yield ±0. Note that other calculations can yield ±∞ (e.g.x/0) or ±0 (e.g. x/∞).

There are many NaNs, though the significance of the different values is not standardized. To all intentsand purposes, the user can regard all NaNs as identical, although system-dependent information may becontained in a particular NaN value†. The table below shows how the various categories of number arestored.

Exponent Fraction Represents

e = emin − 1 f = 0 ±0 zero

e = emin − 1 f 6= 0 ±0.f × 2emin denormal numbers

emin ≤ e ≤ emax any f ±1.f × 2e normalized numbers

e = emax + 1 f = 0 ±∞ infinities

e = emax + 1 f 6= 0 NaN Not a Number

Note that as a special exponent is used to denote denormal numbers, the ‘hidden bit’ device can be used forthese as well, except that the hidden bit is always 0 in this case. The following is a list of operations thatproduce a NaN :

any operation involving a NaN ∞/∞

∞ + (−∞) x REM 0

0 ∗∞ ∞ REM x

0/0√

x for x < 0

A manufacturer could choose to use the significand of a NaN to store system specific information. If sothen the result of any operation between a NaN and a non-NaN must be the value of the NaN ; the resultof an operation between two NaNs must be the value of one of them.

An arithmetic operation involving ±∞ typically produces a NaN or ±∞, where the sign of ∞ is determinedby the usual rules of arithmetic. An exception is that ±x/±∞ is defined to be ±0 for any ordinary numberx. This has advantages and disadvantages. For example, consider the evaluation of f(x) = x/(x 2 + 1). Ifx is so large that x2 evaluates to ∞ then f(x) evaluates to 0 instead of the (representable) approximation1/x. This can be turned to advantage by evaluating f(x) = 1/(x + 1/x) instead; not only does the aboveproblem go away, but f(0) is correctly evaluated without having to treat it as a special case‡. The onlyreal disadvantage of ‘infinity arithmetic’ is that poorly constructed algorithms can produce wrong results,whereas they would produce error messages or raise exceptions on non-IEEE machines.

The fact that IEEE arithmetic has two representations of zero is controversial. One justification for it isthat 1/(1/x) evaluates to x for x = ±∞. Another is that two signed zeros are sometimes useful to refer tofunction values on either side of a discontinuity; this is particularly useful in complex arithmetic but is not

† It is believed that no software exists to exploit this potentially useful facility.‡ In general, reducing the number of tests for special cases improves efficiency on pipelined machines or when

optimizing compilers are used.

21

discussed here. A simple real arithmetic example is that log(+0) can be defined as −∞ and log(−0) as aNaN to distinguish between underflowed cases.

The IEEE Standard compromises by requiring that +0 and −0 are equal in all tests. In particular the test‘if (x == 0) then . . .’ does not take the sign of x into account. However the sign of 0 is preserved inoperations to the extent that, say, (−3) ∗ (−0) evaluates to +0 and (+0)/(−3) evaluates to −0. The sign of0 raises some interesting questions discussed in 3.3.2.

The IEEE Standard includes denormal numbers mainly to ensure that the property

x == y if and only if x − y == 0

holds§ for all finite x. For this to be true, denormal numbers are required to represent x − y when its valuewould otherwise underflow to ±0. Also program code such as

if (x 6= y) then z = 1/(x − y)

could otherwise fail due to division by zero. Although limited to a few simple properties, this self-consistencyis an elegant feature of IEEE arithmetic†.

The IEEE Standard includes the concept of status flags which are set when various exceptions occur.Implementations are required to provide a means to read and write these from programs. Once set, astatus flag should remain so until explicitly cleared. One use of these is to distinguish between an ordinaryoverflow and a calculation that genuinely yields ∞, e.g. 1/0.

The IEEE Standard allows for trap handlers that handle exceptions, and read and write status flags. Howevertrap handlers, which are ‘procedures’, are merely recommended and not required by the Standard.

There are five status flags corresponding to the exceptions: overflow, underflow, division by zero, invalidoperation, and inexact. The first two are straightforward. A division by zero actually applies to any operationthat produces an ‘exact infinity’. An invalid operation is any operation that gives rise to a NaN when noneof its operands is a NaN . An inexact exception arises simply when an operation cannot be evaluated exactlyin floating-point, which is clearly a very common occurrence — it is widely ignored or poorly implemented.

It is recommended that the status flags be implemented by software for efficiency. For example, the inexact

exception will occur very frequently so it is more efficient for software to disable hardware testing for thisonce it has occurred. If the appropriate status flag is reset then testing for the inexact exception can bere-enabled.

There are various useful applications of trap handlers. For example, for computing

n∏

i=1

xi

where the product may overflow or underflow at an intermediate stage, even if the result is within therepresentable range. The evaluation of this product can be implemented by means of overflow and underflowtrap handlers. The overflow trap handler works by maintaining a count of overflows and wrapping theexponent of a product so that it remains within the representable range‡. The underflow handler does the

§ Goldberg notes that properties such as this, that are useful to make programs provable, tend to lead tobetter programming practice, even though proving large programs correct may be impractical.

† Unfortunately, some equally desirable properties do not hold. For example, it is not necessarily true that(x ∗ z)/(y ∗ z) evaluates to a good approximation to x/y, because the representation of (x ∗ z) or (y ∗ z) maybe denormal.

‡ Wrapping, in this sense, means using modular arithmetic to ensure that any exponent is representable. Ifcounts are kept of overflows and underflows then the true exponent can be recovered.

22

converse. At the end of the calculation the result is within the representable range if the number of overflowsis equal to the number of underflows; in this case the wrapping algorithm ensures that the final exponent iscorrect. The cost is almost nothing if no intermediate product overflows or underflows.

In order that this sort of algorithm can be easily implemented, IEEE 754 specifies that the overflow andunderflow handlers must be passed the offending value, as an argument, in ‘wrapped-around’ form.

By default, IEEE arithmetic rounds to the nearest number using ‘round to even’. Three other alternativesare provided: round towards 0, round towards −∞, and round towards +∞. A combination of the lattertwo enables ‘interval arithmetic’ to be performed.

The rationale for ‘double extended’ precision is that a floating-point algorithm often requires some extraprecision for certain operations; however modern computer instruction sets tend not to provide this facility.The multiplication of two numbers to produce a result of greater precision is an operation that is particularlyuseful. Specifically, for calculating (1) b2 − 4ac in the quadratic formula, (2) inner products, and (3) thecorrection to an approximation in certain iterative algorithms.

Exercise 1f

What are the results of the following operations?

(−1)/(−∞) (−0) ∗∞ ∞ ∗ NaN

(Give signed results where appropriate.)

Exercise 1g

Suppose the IEEE Standard were changed to permit a binary implementation with only 5 bits torepresent each number: a sign bit, 2 bits for the exponent, and 2 bits for the precision. Assumingall other features of the Standard are unchanged, including the use of a hidden bit, enumerate all 32possible bit patterns and what they would represent.

2. Elementary numerical methods

2.1 Numerical differentiation

Numerical differentiation is an ill conditioned problem. Nevertheless, it is important because many numericaltechniques require approximations to derivatives. It is a good point to start because it introduces in a simpleway the ideas of discretization error§ (error due to approximating a continuous function by a discreteapproximation) and rounding error (error due to floating-point representation).

2.1.1 Finite differences

The usual definition of the derivative f ′(x) is

f ′(x) = limh→0

f(x + h) − f(x)

h

so an obvious approximation to f ′(x) is achieved by choosing a small quantity h and evaluating

f ′(x) =f(x + h) − f(x)

h. (2.1)

We say that f ′(x) is a finite difference approximation to f ′(x). The only problem here is: how small shouldh be? This turns out to be critical.

§ Discretization error is often called truncation error, but this term is somewhat confusing.

23

Supposing that f(x) can be differentiated at least three times, Taylor’s theorem tells us that

f(x + h) = f(x) + h.f ′(x) +h2

2!.f ′′(x) + O(h3).

This can be rearranged to give

f ′(x) =f(x + h) − f(x)

h− h

2.f ′′(x) + O(h2) (2.2)

so that, comparing (2.1) and (2.2), the discretization error is approximately h2 |f ′′(x)| in absolute value.

Now assume that f(x) can be evaluated with a relative error of approximately macheps. Writing [f(x)] ∗ forthe floating-point representation of f(x), the calculation involves the evaluation of

[f(x + h)]∗ = f(x + h) + εx+h

[f(x)]∗ = f(x) + εx

from which, using (2.1), we get

[f ′(x)]∗ = f ′(x) +εx+h − εx

h

= f ′(x) +A

h

where A is very approximately the absolute error in the representation of f(x), i.e. macheps.|f(x)|. Therounding error is therefore approximately macheps.|f(x)|/h.

Note that, as h decreases, the discretization error decreases but the rounding error increases. To find thevalue of h that minimizes the total absolute error, we differentiate the right-hand side of the equation

total absolute error =h

2|f ′′(x)| +

macheps.|f(x)|h

with respect to h, and solve1

2|f ′′(x)| − macheps.|f(x)|

h2= 0. (2.3)

It turns out that this is achieved precisely at the value of h for which|discretization error| = |rounding error|. However, equation (2.3) involves quantities that are not known.For example, we cannot know f ′′(x) since we are trying to calculate f ′(x) in the first place. We are forcedto make an approximation, so we assume that f(x) and f ′′(x) are numbers of order 1 (we sometime writeO(1)) so we can ignore them in the equation. Then, we might as well also ignore constants, so that equation(2.3) becomes

h2 ' macheps

orh '

√

macheps.

This is justifiable because it is the order of magnitude of h that is critical, not its exact value. It followsthat the total absolute error in the computed derivative is O(

√macheps). However, as we assumed that

f(x) = O(1), it would be more realistic to use h =√

macheps.|f(x)| in practice in order that the relative

error is O(√

macheps), under appropriate assumptions.

Exercise 2a

List the assumptions made in the analysis in Section 2.1.1. Construct an example for which at leastone of these assumptions is unreasonable, and demonstrate why. Why are these assumptions made?

24

2.1.2 Second derivatives

It is, of course, possible to approximate f ′′(x) by taking a finite difference of values of f ′(x) but, again, careneeds to be taken in choice of the value of h.

Suppose, for simplicity, that f(x) = O(1) and we use h =√

macheps to compute values of f ′(x). Therelative error in f ′(x) is approximately h. As an example, take macheps = 10−16. We would use h = 10−8

and expect a relative error of 10−8 in f ′(x).

In calculating f ′′(x) we must now regard 10−8 as if it were the rounding error, so we should use h = 10−4

as the optimum value of h, giving an approximate relative error of 10−4 in f ′′(x).

Fortunately, few numerical methods require more than two derivatives. Note that, if f ′(x) is known tofull floating-point precision, then f ′′(x) can be calculated with a relative error of about 10−8; this is ofimportance in designing software interfaces for certain numerical problems.

2.1.3 Differentiation of inexact data

Because numerical differentiation is ill-conditioned, the problem is made much worse if the function f(x) isknown only at a finite number of points or if the function values are known only approximately. A littleconsideration of the above analysis shows that there is no point in attempting to calculate derivatives fromthe data as it stands. The only sensible course of action is to fit a smooth curve to the data by some means,and to calculate derivatives of the fitted curve, as indicated in the diagram.

It is not possible to make precise statements about the accuracy of the resulting derivatives, but the techniqueis useful in some circumstances.

A method often used in practice is to perform a least squares fit with a cubic spline. This is one of manyapplications of cubic splines, which are described in the next section. Least squares fitting is discussed inSection 2.3.5.

2.2 Splines

We begin with the problem of interpolation, that is, to find a curve of a specified form that passes througha given set of data points.

The simplest case is to find a straight line through the pair of points (x 1, y1), (x2, y2). The equation of the

25

straight line is

y = a1x + a0

and the problem consists of solving 2 equations (one for each data point) in the 2 unknowns a 0, a1. We couldsay that the problem has 2 degrees of freedom.

A related problem is to find a straight line that passes through one data point (x 1, y1), with specified slopem, say. This problem still has 2 degrees of freedom because there are two items, the data point and theslope, that can be varied independently and there are still 2 equations in 2 unknowns.

In fitting a quadratic curve there are 3 degrees of freedom. For a cubic curve

y = a3x3 + a2x

2 + a1x + a0 (2.4)

there are 4 degrees of freedom, etc. With 20 points we could, in principle, fit a polynomial of degree 19 butthe results might not be as we would wish, e.g.

Low order polynomials do not have this disadvantage. Suppose we decide to interpolate data by fitting twocubic polynomials, p1(x) and p2(x), to different parts of the data separated at the point x = t. We have 8degrees of freedom and so can fit 8 data points, e.g.

t

2p (x)1p (x)

26

This is worse than using a high order polynomial. What is required is some degree of continuity, e.g. in thecurve

12

t

p (x)p (x)

if we require that p1(t) = p2(t) then the curve is at least continuous. But this requirement uses up one ofour equations, so we now have only 7 degrees of freedom. We can place further constraints as follows:

p′1(t) = p′2(t)

p′′1(t) = p′′2 (t)

with an extra degree of freedom taken up by each. This gives a curve with smooth continuity but only 5degrees of freedom. (If we go a step further and require that p′′′

1 (t) = p′′′2 (t) then the two cubics becomeidentical and it is equivalent to fitting (2.4).) The point t is called a knot point. It is possible to specify msuch knots and to fit a curve consisting of m + 1 separate cubics with second derivative continuity. This hasm + 4 degrees of freedom.

A curve made up of cubic polynomials with continuity of the function and first and second derivatives iscalled a cubic spline. Numerical methods often require at least this degree of continuity and a cubic splineis, in some sense, the simplest function that satisfies this requirement.

For the purposes of numerical differentiation of inexact data, fitting a cubic spline and differentiating theresult is probably the best that can be achieved, bearing in mind that there is an infinite choice of knotpositions.

2.3 Simultaneous linear equations

In this section we consider the solution of simultaneous linear equations of the form

Ax = b (2.5)

where A is a given matrix of coefficients, b is a given vector, and the vector x is to be determined. We shallconsider here only the case where A is square and at least one element of b is non-zero. In this case theequations have a unique solution if and only if A is a non-singular matrix†. Mathematically, the solution isgiven by

x = A−1b.

The first point to note is that there is no need to calculate the inverse A−1 explicitly because the vectorA−1b can be calculated directly. Also, calculating A−1 and then multiplying by b leads to unnecessary lossof accuracy. The calculation of a matrix inverse, which is discussed briefly below, is usually avoided unlessthe elements of the inverse itself are required for some purpose, e.g. in some statistical analyses.

† Note that if A is singular this means that it has no inverse. This is equivalent to saying that A has zerodeterminant or that A has a zero eigenvalue.

27

The solution of the equations (2.5) is trivial if the matrix A has either lower triangular form

a11

a21 a22

a31 a32 a33

. . . . . . . . . . . .an1 an2 an3 . . . ann

or upper triangular form

a11 a12 a13 . . . a1n

a22 a23 . . . a2n

a33 . . . a3n

. . . . . .ann

where missing coefficients are zero. Note also that if (and only if) any diagonal coefficient a ii is zero thenthe matrix A is singular and there is no unique solution.

To see that the solution is trivial to obtain, consider the upper triangular form. In this case the last equationcontains only one unknown, xn, that can therefore be determined by

xn =bn

ann

.

Then, as xn is now known, the second last equation contains only the one unknown xn−1 which may thereforebe determined, and so on.

This so-called back substitution algorithm may be summarized as

xi :=bi −

∑n

j=i+1 aijxj

aii

for i = n, n − 1, . . . 1. (Note that, by convention, the summation is over zero terms when the limits do notmake sense.)

For a general set of equations Ax = b the solution x is unchanged if any of the following operations isperformed:

(1) Multiplication of an equation by a non-zero constant.

(2) Addition of one equation to another.

(3) Interchange of two equations.

The technique used for solving a general set of equations Ax = b is called Gaussian elimination and consistsof using a combination of the operations (1), (2) and (3) to convert the equations to a trivial case, byconvention the upper triangular form. There are a large number of different ways of achieving this. Thestrategy adopted is usually called the pivotal strategy, and we shall see that this is crucially important.

We consider two examples.

Example 5

This example is a well-conditioned problem that illustrates the need for a pivotal strategy. Considerthe 3 simultaneous equations

0 1 11 0 11 1 0

x1

x2

x3

=

124

28

which are to be converted to upper triangular form. Note that the first equation cannot form the firstrow of the upper triangle because its first coefficient is zero. Therefore the first step is to interchangethe first two equations, thus

1 0 10 1 11 1 0

x1

x2

x3

=

214

.

The next step is to subtract the (new) first equation from the third to get

1 0 10 1 10 1 −1

x1

x2

x3

=

212

.

The final step is to subtract the second equation from the third and the solution x1 = 52 , x2 = 3

2 ,x3 = − 1

2 results.

This is a fairly trivial exercise on a sheet of paper, but it indicates that a computer subroutine to deal withthe general case is less straightforward because several coefficients may be zero at any step. A successfulpivotal strategy must specify which of the operations (1), (2) and (3) to perform on which equations at eachstage of the calculation.

Example 6

This example shows that it is not only zero coefficients that cause trouble. Consider the pair ofequations

(

0.0003 1.5660.3454 −2.436

) (

x1

x2

)

=

(

1.5691.018

)

which have the exact solution x1 = 10, x2 = 1. However, if we perform the calculation on a computerwith macheps ' 0.5× 10−4, and multiply the first equation by 0.3454/0.0003 (' 1151) and subtractfrom the second, we get the solution x1 = 3.333, x2 = 1.001 because of loss of significance. The useof a small non-zero number as a pivotal value has led to an incorrect answer.

In fact, closer examination of the loss of significance shows that it is due to the fact that the coefficient0.0003 is small compared with 1.566.

Therefore a successful pivotal strategy requires some concept of the size of a coefficient relative to the othercoefficients in the same equation. This relative size is conveniently calculated by dividing each coefficient bythe l∞ norm of the corresponding row of the matrix. In other words, we scale each equation by dividing itby its largest coefficient (in absolute value). In Example 6, we would divide the first equation by 1.566 andthe second by 2.436. This provides a simple automatic way of checking whether a coefficient is (relatively)small.

2.3.1 Pivotal strategies

Suppose an n × n set of linear equations is being solved by Gaussian elimination. At the 1st step there aren possible equations and one is chosen as pivot. This eliminates one variable and leaves n − 1 equations atthe 2nd step. At the start of the kth step there are therefore n − k + 1 remaining equations from which toselect a pivot.

In partial pivoting these n − k + 1 equations are scaled, by dividing by the largest coefficient of each, thenthe pivotal equation is chosen as the one with the largest (scaled) coefficient of x k.

In total pivoting both the pivotal equation and pivotal variable are selected by choosing the largest (unscaled)coefficient of any of the remaining variables. Total pivoting may be better than partial pivoting, although itis more expensive and so is not often used.

29

Work in the 1940s, using forward error analysis, had suggested that the error in Gaussian elimination grew ata rate that would make the process unstable for large systems of equations. In the 1950s J.H. Wilkinson usedbackward error analysis to show that this predicted rate of error growth was merely the worst case, and notone that appears to occur in practical problems. Gaussian elimination is generally stable for well-conditioned

problems.

Exercise 2b

List the basic operations of Gaussian elimination, i.e. the three operations on the equations Ax = b

that leave the solution x unchanged. Illustrate how these operations are combined, using exactarithmetic on the matrix

0 1 32 −2 14 1 0

Show how the method is modified for floating-point arithmetic, by using partial pivoting on the matrix

4 1 20 10−4 31 10.25 4

(In each case, omit the back substitution phase of the calculation.)

2.3.2 Symmetric positive definite equations

Many practical numerical problems reduce ultimately to the solution of sets of simultaneous linear equations.For difficult problems, e.g. the solution of a partial differential equation in a complicated 2- or 3-dimensionalregion, the resulting linear equations may be huge, say 10000 simultaneous equations.

In general, such large systems may be arbitrarily ill-conditioned and the resulting matrix may appear to be

singular when floating point arithmetic is used. Worse still, the relative errors may become much greaterthan 1 so that incorrect solutions will be obtained. Indeed even a 2×2 system can lead to inaccurate answersif the matrix is ‘nearly’ singular.

However, an important class of problems is always well-conditioned, and Gaussian elimination gives reliableanswers for any size of matrix. These are problems in which the matrix is symmetric positive definite, andpivoting is not needed.

A symmetric matrix A is positive definite if it satisfies the inequality

xT Ax > 0

for any non-zero vector x. For example, if

A =

(

4 11 4

)

then

xT Ax = ( x1 x2 )

(

4 11 4

) (

x1

x2

)

= 4x21 + 2x1x2 + 4x2

2

= (x1 + x2)2 + 3(x2

1 + x22)

> 0

in all cases. It is not always easy to verify this property, but an important special case is where B is anyreal non-singular matrix and A = BT B, then

xT Ax = xT BT Bx = (Bx)T Bx > 0.

30

Fortunately, good algorithms for symmetric positive definite equations can detect when a matrix is not, infact, positive definite so it is not essential to check this independently before attempting a calculation.

Many important practical problems give rise to symmetric positive definite equations.

2.3.3 Choleski factorization

An instructive way of looking at Gaussian elimination is to consider it in terms of matrix algebra. This leadsto a useful algorithm for the solution of symmetric positive definite equations.

When solving Ax = b by Gaussian elimination the matrix A is transformed to an upper triangular matrixU by means of operations which include the permutation of rows and columns of the matrix.

A permutation matrix, call it P, is simply a matrix that permutes another matrix when it is multiplied byP; its elements are all either 0 or 1 in such a way that there is exactly one 1 in each row and column. Theunit matrix I is thus the (trivial) permutation matrix that does not change anything.

It may be shown that Gaussian elimination (with partial pivoting) is equivalent to factorizing the matrix A

such thatA = PLU

where L is a lower triangular matrix. The equations Ax = b may then be written PLUx = b which meansthat the resulting upper triangular system may be expressed as

Ux = L−1P−1b

where the matrices P and L are straightforward to invert.

Choleski factorization may be applied to symmetric positive definite matrices and differs from the PLU

factorization in three ways:

(1) Because symmetric positive definite equations are well conditioned no permutations are necessary soP = I, i.e. P can be omitted.

(2) By symmetry it is possible to arrange that U = LT so the factorization can be expressed in the formA = LLT . However, the diagonal elements of L involve taking square roots; this can be avoided.

(3) In order to avoid taking square roots the diagonal elements of L are defined to be 1 (and so do notneed to be stored) and the factorization used is

A = LDLT

where D is a diagonal matrix. This is the Choleski factorization.

An example of a Choleski factorization is:(

4 11 4

)

=

(

114 1

) (

4154

) (

1 141

)

.

The algorithm for solving symmetric positive definite equations is as follows: form the Choleski factorization,then solve

Ly = b

LT x = D−1y

2.3.4 Matrix inversion

Suppose that the two sets of equations Ax = b and Ay = c are to be solved, i.e. two sets of equations withthe same coefficient matrix. This problem could be written

A (x y ) = (b c )

31

which can be solved in one go by Gaussian elimination. Note that both the solution and the right-hand sideare now matrices rather than vectors. In the same way, one application of Gaussian elimination is sufficientto solve the matrix equation AX = C where X is an unknown matrix and C is a constant matrix of thesame shape as X.

In particular, if we take C to be the unit matrix I then the solution X will be A−1.

As stated above, it is not usually necessary or desirable to invert a matrix, but sometimes the elements ofthe inverse itself are required and Gaussian elimination can be used as described.

2.3.5 Linear least squares

Problems are often encountered where it is desirable to find an n-dimensional vector, call it x n, that mostclosely satisfies (in some sense) the set of m equations

Amnxn = bm (2.6)

where Amn is an m × n matrix and m > n. This is an overdetermined system and is usually solved in theleast squares sense. Define the residual vector

rm = Amnxn − bm.

Then the least squares solution of (2.6) is the vector xn that minimizes the sum of squares of the residuals,i.e.

rT r = (Ax − b)T (Ax − b).

The least squares solution may be shown to be the solution of the equations

AT Ax = AT b

where AT A is square and symmetric positive definite. These so-called normal equations may be solved byCholeski factorization. However, since the mid 1970s, this approach has been replaced by solving directlythe overdetermined system Ax = b by a method known as singular value decomposition‡.

2.4 Non-linear equations

Having dealt with linear equations, we will now look briefly at the solution of simultaneous non-linearequations. An example of a system of 2 equations with 2 unknowns is

x1 + sinx2 −8

7= 0

x1x2 −15π

28= 0

(2.7)

which has a solution x1 = 9/14, x2 = 5π/6, which is easily verified. However there are two other solutionswith x2 values close to 5π/2. This is an extra complication: non-linear equations do not, in general, have aunique solution. In fact, a set of non-linear equations can have any number of solutions; there may be nosolutions at all, or there may be infinitely many.

‡ Singular value decomposition is a stable method that can be used to solve almost any system of linearequations, yielding a unique solution. Approximate solutions can be obtained for singular matrices andnon-square matrices. This method is beyond the scope of this course – see Numerical Analysis II.

32

Exercise 2c

Show, graphically or otherwise, that there are exactly three pairs of solutions to the equations (2.7).

It is generally true in numerical analysis that problems without a unique solution are more difficult. Suchproblems must usually be solved iteratively, by using a starting approximation close (in some sense) to therequired solution. Two particular difficulties that arise are (a) convergence to the ‘wrong’ solution, and (b)failure to converge, sometimes because approximations oscillate between two or more solutions.

A set of n non-linear equations in n unknowns x1, x2, . . . xn may be written

f1(x1, x2, . . . xn) = 0

f2(x1, x2, . . . xn) = 0

. . .

fn(x1, x2, . . . xn) = 0

which can be abbreviated to f(x) = 0 where x is the vector of unknowns and f is the vector of functions f j .

2.4.1 Newton-Raphson iteration

We consider first the equation f(x) = 0 where f(x) is a smooth non-linear function of a scalar x. TheNewton-Raphson method is an iterative technique that is graphically equivalent to improving the estimatedsolution by moving along the gradient of the function f(x), evaluated at the point x.

x~

x+h

x

f(x)x

The formula may be derived from the Taylor series for f(x) from which we get

f(x + h) = f(x) + hf ′(x) + O(h2). (2.8)

If x is an approximate solution and x + h is the exact solution then f(x + h) = 0 and we can estimate thevalue of h from (2.8) to get

h = − f(x)

f ′(x)+ O(h2).

Writing x to represent an improved approximation we have

x = x − f(x)

f ′(x). (2.9)

33

The error in this approximation is O(h2) so it has a quadratic rate of convergence. In fact this rate ofconvergence is only obtained when the approximation is sufficiently close to the solution. Note that (2.9)could be written in either of the alternative forms

x = x − [f ′(x)]−1f(x)

f ′(x)(x − x) = −f(x).

As a simple example of the use of (2.9), consider the calculation of√

2. This is equivalent to solving theequation f(x) = 0 where f(x) = x2 − 2. In this case f ′(x) = 2x and (2.9) becomes

x = x − x2 − 2

2x

=x2 + 2

2x

=x

2+

1

x.

Using x = 1 as a starting approximation, successive values of x (obtained on a BBC micro) are 1.5,1.41666667, 1.41421569, 1.41421356. (Squaring the last approximation gave 2 to the accuracy of the floatingpoint arithmetic.)

Exercise 2d

The Newton-Raphson formula (2.9) was compared with

x = x − hf(x)

f(x + h) − f(x)(2.10)

for solution of the equation x2 − 5 = 0. Using a particular value of h, the following numerical resultswere obtained for the first three iterations of each method using the starting value x = 2 in each case:

iteration method(2.9) method(2.10)

0 2.000000000 2.0000000001 2.250000000 2.2359882012 2.236111111 2.2360639563 2.236067978 2.236067775

For the first two iterations method (2.10) was more accurate, but method (2.9) was more accurate onthe third iteration. Which method would you expect to converge faster, and why?

Suppose method (2.10) is used on a computer for which macheps is about 10−24. Discuss a suitablechoice of the parameter h.

2.4.2 Generalization of Newton-Raphson

Newton-Raphson iteration can be generalized for solving n non-linear equations in n unknowns, i.e. forsolving the vector equation f(x) = 0. The derivative f ′(x) generalizes to a matrix of partial derivatives, the(i, j)th element of which is

∂fi(x)

∂xj

for i = 1, 2, . . . n and j = 1, 2, . . . n. This matrix of derivatives is called the Jacobian J. As division isnot defined for matrices, we must use one of the alternative forms of (2.9) to express the Newton-Raphsonformula, i.e.

x = x − J−1f

34

or, in order to express this as the solution of a set of linear equations,

J(x − x) = −f . (2.11)

Note that the solution x − x of (2.11) is the vector of increments to be added to x to form the newapproximation.

The rate of convergence of the generalized Newton iteration can also be shown to be quadratic if theapproximation is sufficiently accurate but, unfortunately, multi-dimensional problems are more difficult andit is much harder to ensure convergence. Algorithms for non-linear equations usually involve Newton-Raphsoniteration, often as a final step, but more strategy and care is needed than in the scalar problem.

Nevertheless, to illustrate that the formula (2.11) is of some use, we solve the pair of equations (2.7). Herethe Jacobian is

(

1 cosx2

x2 x1

)

.

The starting approximations x1 = 1, x2 = 3 yield successive approximations

x1 = 0.672016558 x2 = 2.66694639

x1 = 0.643149516 x2 = 2.61895767

x1 = 0.642857175 x2 = 2.61799418

x1 = 0.642857143 x2 = 2.61799388

where the last pair of values is equal to 9/14, 5π/6 to the accuracy of the arithmetic. The starting valuesx1 = 1, x2 = 7 lead to convergence to x1 = 0.226117142, x2 = 7.4430273, and starting values x1 = 1,x2 = 8 lead to convergence to x1 = 0.205031727, x2 = 8.20846651 corresponding to the other two solutions.However, it is all too easy to find starting values which do not lead to convergence. For example, takinga starting value in between the latter two solutions yields particularly misleading results. If x 1 = 0.2,x2 = 5π/2 then successive convergents ‘settle down’ in the region of x1 = 0.008, x2 = 20 although onlythe crudest of error tests would be deceived into falsely detecting convergence. There is no solution of theequations near these values, although a little graphical consideration will show why the algorithm tries tofind one.

Exercise 2e

Investigate this problem by writing a short program, using any language with a floating-point datatype, to find starting values that lead to convergence and otherwise.

It should be noted that it is often cheaper (in terms of programmer time or computation time) to usenumerical derivatives to estimate the Jacobian matrix J in (2.11) but the resulting loss of accuracy mayitself cause problems.

2.4.3 Ill-conditioned problems

We now return to the scalar non-linear equation f(x) = 0 to investigate cases in which the problem isill-conditioned; these are often cases in which Newton-Raphson iteration becomes unstable. One such caseis where f ′(x) = 0 at the solution of f(x) = 0 such that a Newton-Raphson step performed at the solution

35

would result in a 0/0 calculation. A typical example is shown below:

x

f(x)

Note that if the x-axis were shifted downwards there would be no solution, and if shifted upwards therewould be two solutions. An ingenious way of overcoming the difficulty is to solve a sequence of (different)problems which successively get closer to the original problem. This sequence of problems is chosen suchthat each is better conditioned than the original problem. Techniques of the type are known as continuation

methods.

In order to solve a difficult case of the problem f(x) = 0, with a starting approximation x 0, we first solve

f(x) = λf(x0) (2.12)

for some parameter 0 < λ < 1. Note that the original problem corresponds to λ = 0 and that the problemλ = 1 is guaranteed to have a solution (x = x0). By taking a succession of decreasing values of λ and using asuitable robust algorithm (not Newton-Raphson) the solution is obtained to a sequence of problems that getnearer to the original problem. Usually, as λ approaches zero, these solutions tend to the required solution.

Note that an algorithm may be described as robust if it can be guaranteed to be successful for any validdata. This is often achieved at the expense of time.

2.5 Quadrature

2.5.1 Quadrature rules

Quadrature is the name given to the numerical evaluation of an integral of the form

I =

∫ b

a

f(x)dx

where either (or both) of a or b may possibly be infinite. An approximation of the form

∫ b

a

f(x)dx 'n

∑

i=0

wif(xi)

where the points xi are chosen such that a ≤ x0 < x1 < . . . xn ≤ b, is called a quadrature rule. The pointsxi are called the abscissae and the numbers wi are called weights.

Quadrature rules are usually derived by the integration of an interpolating polynomial. Normally, to avoidloss of significance, only rules with positive weights are used.

We define the error in a quadrature rule by

en = I −n

∑

i=0

wif(xi)

36

and it is possible to obtain an expression for this from the corresponding interpolating polynomial §.

We will now examine a few simple quadrature rules to see how their error terms compare. We consider thecase where a and b are both finite. The simplest reasonable quadrature rule is obtained by taking the valueof f(x) at the mid-point of the interval (a, b) and treating the function as if it were a constant:

���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

a ba+b 2

Not surprisingly, this is called the mid-point rule. The formula, including the error term, is

I = (b − a)f(a + b

2) +

f ′′(ξ)(b − a)3

24(2.13)

for some ξ where a < ξ < b. Alternatively we can fit a straight line between the function values at theend-points of the interval:

������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

��

���

a b

This is the trapezium rule with the formula

I =b − a

2[f(a) + f(b)] − f ′′(η)(b − a)3

12(2.14)

for some η where a < η < b. A spectacular improvement is obtained by a fitting a quadratic curve through

§ See Conte & de Boor.

37

the mid-point and the end-points:

����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

2a+b ba

������

���������

���������

������������

�

!!!!!!!!!

"""""""""

This leads to Simpson’s rule which is

I =b − a

6[f(a) + 4f(

a + b

2) + f(b)] − f iv(ζ)( b−a

2 )5

90(2.15)

for some ζ where a < ζ < b.

This idea can be extended by, say, interpolating four equally-spaced points with a cubic polynomial.Quadrature rules derived from equally-spaced points are called Newton-Cotes formulae. High order Newton-Cotes rules are rarely used for two reasons: (i) in high order rules some weights are negative, which leads toinstability, and (ii) methods based on high order polynomials usually have the same disadvantages as highorder polynomial interpolation.

A more promising approach than increasing the order of the interpolating polynomial is to use a composite

rule. A composite rule is formed by splitting an integral into a set of panels and applying (usually) the samequadrature rule in each panel and summing the results. For example the composite mid-point and compositetrapezium rules:

#�#$%�%& '�'(�()�)*�*+,-. /012 3456 7�78�89�9:�: ;�;;�;<�<<�<=�==�=>�>>�>?@

ba

ABCD EF

b

G�GG�GH�HH�HI�II�IJ�JJ�JKLMNOOPPQQRR

STUVWXYZ [�[\�\]�]^�^_�__�_`�``�`a�aa�ab�bb�b ccd

d

a

eeefffgh

iijj

Let there be n panels. Writing h for the width of each panel, these lead to the similar formulae

I ' h

n∑

i=1

f(a + {i− 1

2}h)

38

for the composite mid-point rule, and

I ' h

n∑′′

i=0

f(a + ih)

for the composite trapezium rule, where the double prime on the summation indicates that the first and lastterms are each halved. The fact that these formulae are similar helps to explain why the error terms forthese rules are also similar.

Replacing b − a by h in (2.13) and (2.14) we see that these rules each have order of convergence h3 in anyone panel. But there are n panels, and n = O(h−1), so the order of convergence of the composite rule ish−1.h3 = h2.

The composite Simpson rule has the formula

I ' h

6[f(a) + f(b) + 2

n−1∑

i=1

f(a + ih) + 4n

∑

i=1

f(a + {i − 1

2}h)]

which, from (2.15), is of order h−1.h5 = h4.

In general, a single quadrature rule derived from an interpolation formula based on n abscissae is guaranteedto have order at least hn+1, although formulae based on specially advantageous choices of the abscissae havehigher orders of convergence. Note that both the mid-point rule and Simpson’s rule have one higher orderof convergence than expected; the trapezium rule has the minimum order. The maximum possible order ofconvergence for a polynomial rule based on n abscissae is h2n+1; such a rule is called a Gauss rule† and usesoptimally chosen abscissae. These abscissae turn out to be non-equally spaced points which, incidentally,never include the end-points of the interval; for odd values of n the mid-point is always one of the Gaussabscissae.

Because an arbitrary function f(x) is potentially expensive to calculate, the efficiency of a quadrature schemeis usually measured by the number of function evaluations required to estimate the integral to a specifiedaccuracy. In a composite rule it is therefore an advantage for the end-points to be abscissae, because afunction evaluation at an end-point of one panel is used again in the next panel. This also explains whythere is so little difference between the composite mid-point and composite trapezium rules in practice.

An interesting compromise between efficiency and high order is achieved by a Lobatto rule. This is ann-point quadrature rule in which the end-points are included and the remaining n − 2 abscissae are chosento achieve the maximum possible order of convergence (which then turns out to be h 2n−1). The importanceof Simpson’s rule is explained by the fact that it is the simplest Lobatto rule.

2.5.2 Summation of series

This section gives a simple example of the use of numerical analysis to devise and tune an algorithm forsolving a problem that is easy to pose, but difficult to solve by straightforward means. The method usesthe mid-point formula (2.13) in order to sum a series by evaluating an integral: this may be thought of asquadrature in reverse‡. In order to simplify (2.13) a little, first consider the integral of some function f(x)over an interval of unit length. Writing

In =

∫ n+ 12

n− 12

f(x)dx

we see that (2.13) simplifies to

In = f(n) +f ′′(θn)

24

† The derivation of Gauss rules is discussed in Conte & de Boor.‡ The method described is also related to the so-called Euler-Maclaurin summation formulae.

39

where θn is some (unknown) value in the interval (n − 12 , n + 1

2 ). Let en be the error term, such that

en = In − f(n) =f ′′(θn)

24.

In the case where f ′′(x) happens to be a positive decreasing function in the interval (n − 12 , n + 1

2 ), we cancertainly say that

|en| ≤f ′′(n − 1

2 )

24.

Using the notation

Sp,q =

q∑

n=p

f(n), (2.16)

we now consider the problem of evaluating the infinite series S1,∞ where, for sufficiently large values of x,f(x) is a positive decreasing function of x and a formula can be found for

∫

f(x)dx. Of particular interestare functions for which S1,∞ is not easy to evaluate by simple summation alone.

We start by devising an algorithm applicable to all suitable functions f(x), then use a particular exampleto see how the algorithm may be tuned for efficiency. Note that a sum of the form (2.16) is equivalent toapplication of the composite mid-point rule to an integral. Suppose that we start off the calculation of S 1,∞

by summing N terms then use the composite mid-point rule to approximate the remainder of the sum. Wehave

S1,∞ = S1,N + SN+1,∞

= S1,N +

∞∑

n=N+1

(In − en)

= S1,N +

∫

∞

N+ 12

f(x)dx − 1

24

∞∑

n=N+1

f ′′(θn)

S1,∞ = S1,N +

∫

∞

N+ 12

f(x)dx + EN (2.17)

where, if f ′′(x) is a positive decreasing function for x > N + 12 , the error estimate is such that

|EN | ≤ 1

24

∞∑

n=N+1

f ′′(n − 1

2)

' 1

24

∫

∞

N

f ′′(x)dx

|EN | ' −f ′(N)

24. (2.18)

Note that, given a target absolute error ε, if the integral in (2.17) can be evaluated then it is possible todetermine a value of N that should achieve this prescribed error given sufficient floating-point precision.

As a specific example of a well-known function that is hard to evaluate directly, consider the Riemann zeta

function:

ζ(k) =

∞∑

n=1

n−k

for any k > 1. Writing f(n) = n−k, the integral in (2.17) is

∫

∞

N+ 12

x−kdx =[ x1−k

1 − k

]

∞

N+ 12

=(N + 1

2 )−(k−1)

k − 1

40

and note that f ′′(x) is positive and decreases with x. From (2.18) we get that |EN | ' k24N−(k+1) and we

estimate the value of N necessary to achieve the target absolute error ε as

N =[ k

24ε

]1

k+1

. (2.19)

The formula to be used is then

ζ(k) =

N∑

n=1

n−k +(N + 1

2 )−(k−1)

k − 1+ EN (2.20)

where |EN | ' ε.

As a numerical example, suppose k = 2 and the target absolute error ε = 10−16. With straightforwardsummation, we can use the integral

∫

∞

N+ 12

f(x)dx to estimate how many terms are needed: this works out at

about 1016. Using (2.19) and (2.20) we can estimate the value

N =[ 2

24.10−16

]13 ' 105

for our algorithm. When k = 2, if a computer takes, say, about 0.1 second for this calculation then thestraightforward sum would take over 300 years.

Finally, it is worth noting that the integral in (2.17) can itself be estimated by quadrature, which allows themethod to be used even when a formula for

∫

f(x)dx is not available. However, the quadrature rule will addto the computational cost of the algorithm and will complicate the error formula.

Exercise 2f

Implement the algorithm described in Section 2.5.2 in any language with a floating-point data type.Use the following known values to test its accuracy:

ζ(2) = π2

6

ζ(4) = π4

90

(If your floating-point implementation has poor numerical properties then you may need to computethe sum carefully to achieve the expected accuracy – recall Exercise 1a.)

If a program timer is available compute ζ(2), say 10 times, and hence estimate how long it would taketo compute the value once, to similar accuracy, using a straightforward sum.

3. Numerical software

3.1 State of the art

For mainly historical reasons, most software for numerical methods is found in subroutine libraries designedto be used from conventional programming languages. This is in contrast to, say, statistical software whichis often provided in the form of a free-standing package. There are a number of special-purpose libraries forsingle application areas, such as linear equations or optimization. Some research organizations have producedgeneral-purpose libraries but these are often restricted to a particular range of computers, or have gaps inareas that are not of interest at the originating site. There are only two portable general purpose librariescurrently in wide use on a world scale. These are the libraries of NAG (Numerical Algorithms Group), basedin Oxford, and Visual Numerics (formerly IMSL), based in the United States.

41

3.2 Languages

It is not possible to proceed very far with this subject before discussing the vexed question of the languages inwhich numerical software has been implemented. The significant languages used up to now have been: Algol(both Algol 60 and Algol 68), Fortran (all dialects), Pascal, Ada, C and C++. By far the most commonly usedlanguage for numerical software has been Fortran, for largely historical reasons. Fortran 90 was standardizedby ISO in 1991 and NAG announced the first compiler to coincide with this standardization†. Whereas theearlier Fortran 77 had few redeeming features as a programming language, it did have standardization andportability in its favour. Also, in view of the substantial body of well-tested software that already exists inFortran, it still has a pre-eminent position. In the 1990s NAG released libraries written in Fortran 90 andC.

In the 1980s, it was predicted that Algol 68, Pascal or Ada would become the future language for numericalsoftware, but none of these forecasts has come true. Algol 68 has virtually died out, and never was much usedin North America. The difficulty with standard Pascal is that 2-dimensional arrays, crucial to the handlingof matrices in numerical subroutines, are not implemented in a suitable way. There was heavy investmentin Ada by the N.A.T.O. countries for military purposes, but the end of the ‘Cold War’ resulted in a rapiddecline in interest in this language for largely non-technical reasons.

It should be noted that some programming languages have features that make them very difficult to use forsensitive floating-point computations. For example, some versions of Basic perform rounding on values closeto integers for display purposes.

It is to be hoped that good numerical libraries will become widely available from better programminglanguages in the future, although this is unfortunately not the case at present. The current trend is toprovide good facilities for mixed language programming so that C or Fortran can be called safely fromhigher level languages. A possible future trend might be towards packaged numerical software.

3.3 Floating-point arithmetic: software considerations

3.3.1 The Brown model

We have already seen that one implementation-specific constant, macheps, is important in numericalalgorithms. In order to circumvent overflow or underflow in an algorithm it is also necessary to knowthe limits of the representable number range. Other constants may be significant in particular algorithms.

IEEE implementations are still far from universal. The task of writing implementation-independentsubroutines is considerably eased by the use of a simplified model of floating-point arithmetic. The Brownmodel‡, or some modification of it, is one that is commonly used. The model assumes that a floating-pointrepresentation can be described by only four parameters:

B - the base of the arithmetic

N - the number of digits (of base B) in the significand

EMIN - the minimum exponent

EMAX - the maximum exponent

then any floating-point number can be represented in the form

± 0.d1d2 . . . dN × BE

where d1d2 . . . dN are digits (of base B) and E is the exponent in the range EMIN ≤ E ≤ EMAX . A libraryneed only contain four implementation-dependent functions that supply the values of these parameters for

† The latest Standard is actually Fortran 95, standardized in 1996, but this is a relatively modest update.‡ Note that the Brown model (due to W.S. Brown, 1981) differs only slightly from the description in 1.2.2.

This model need not necessarily correspond to the way that the exponent and significand are actually stored.

42

the use of portable algorithms. In principle, other implementation-dependent constants can be derived fromthe parameters of the Brown model.

3.3.2 Implementation issues for IEEE arithmetic

IEEE arithmetic is more closely specified than Brown’s model, which was used to define floating-pointarithmetic in the language Ada. For example under IEEE it is possible to prove that (3.0/10.0) ∗ 10.0evaluates to 3.0; this is not possible under the axioms of Brown’s model.

Under IEEE 754, because it specifies the number of bits to be used, the use of exactly rounded operationsmeans that the results of calculations should be identical when a program is moved from one processor toanother, provided they both support the Standard. Furthermore it is possible, at least in principle, to provethat some floating-point algorithms produce answers correct to within prescribed tolerances§.

As one would expect, in practice there are implementations that give inaccurate results, but too muchaccuracy can also be a problem. For example, several processors have a so-called ‘fused MAC’ (MultiplyACcumulate) instruction that computes ±a ∗ b ± c exactly rounded; this cannot be achieved under IEEEarithmetic so violates the Standard! More simply, over-use of ‘double extended’ precision for single precisioncalculations can produce results that are too accurate to be Standard-conforming.

Another difficulty that is encountered in programming languages, e.g. some of the early implementations ofC (i.e. Kernighan & Ritchie C), is that the programmer’s use of parentheses is not honoured on the groundsof efficiency! In floating-point, (x + y) + z is not the same as x + (y + z) and the programmer’s parenthesesmay be inserted specifically to avoid wrong answers. This was corrected when C was first Standardized byANSI†: nobody cares how efficiently the wrong answer can be calculated. Many floating-point algorithmsfail if optimizing compilers assume ‘true real arithmetic’. If optimizing compilers get cleverer then worseproblems may occur — see Section 3.2.3 of Goldberg’s paper for an example.

Care must be taken when mixing precisions because some of the ‘invariance rules’ described above do notapply if precisions are mixed. For example, the number 3.0/7.0 is a recurring fraction in base 2 or 10, soit will have different representations in single and double precision. If a compiler evaluates everything indouble precision, and x is in single precision, then x = 3.0/7.0 may assign the correct value to x, but thetest ‘if (x == 3.0/7.0) then . . .’ may not work as expected.

Under default rounding, x − x is defined to evaluate to +0 for all finite x. Thus (+0) − (+0) = +0. Also−(+0) is defined to be −0, so −x should not be implemented as 0 − x as this would be wrong for x = +0.

There may be conflicts between IEEE and language Standards. For example, one ANSI language Standardstates that ‘any arithmetic operation whose result is not mathematically defined is prohibited’. This raisesproblems because ±∞ can be used as an ordinary value and may violate this language Standard, especiallyif the operation x/∞ gets rid of the infinite value.

What should the sign of −0.0 be in a programming language? It is easy to argue that it should be −1 orNaN . Standard Basic defines it as 0, whereas Standard Fortran defines it as +1.

§ N.B. Proving the correctness of a floating-point algorithm does not imply that it solves the correspondingmathematical problem correctly. Numerical analysis is still necessary to understand the properties andlimitations of the method on which the algorithm is based. For example, it would be very useful to provethe correctness of an adaptive integration procedure (see Section 3.4) but, for any such procedure, theremust exist functions for which it gives arbitrarily inaccurate answers. There is no contradiction here: thenumerical method is a finite approximation to an infinite process, so cannot in principle be made to workin all cases. Proving the algorithm correct amounts to showing that the (intrinsically flawed) numericalmethod has been implemented correctly. Such a proof would be at least reassuring. Nevertheless adaptiveintegration methods are useful in practice except only in pathological cases.

† The ISO C Standard is now far less clear on this point but, in practice, most modern C compilers respectparentheses for floating-point operations.

43

The only reliable way to test the sign of zero under IEEE arithmetic is to compute, say, 1.0/z where z = ±0.0and to test the sign of the result. Unfortunately this sets the division by zero status flag‡.

Some processors can run much slower in IEEE-conforming mode, where this is provided as an option, typicallybecause denormal numbers are expensive to handle.

The use of NaNs violates many of the desirable ‘invariance rules’. The test ‘if (x == x) then . . .’ yieldstrue unless x is a NaN in which case it is always false by definition. Also, the test ‘if (x ≤ y) then . . .’ isnot always the same as ‘if not(x > y) then . . .’ because NaNs are unordered by definition. This is anotherproblem for optimizing compilers.

Exception handling leads to problems with synchronicity on pipelined machines and with optimizingcompilers. Hardware and software solutions to these problems exist but, unfortunately, these are not generallyportable to other systems.

3.4 User interfaces: automatic quadrature

Suppose that

integrate(f, a, b, eta t, result, error)

is a subroutine that returns an estimate of I =∫ b

af(x)dx in result, given a target accuracy eta t to be used

in a mixed error test. The variable error is set non-zero if any kind of error condition is detected (such asdetecting that the integral I does not exist).

There is a significant difference between a subroutine that evaluates, say, a certain composite quadraturerule and a subroutine such as integrate that performs all necessary calculations and returns an estimateof the integral to a prescribed accuracy. The black box approach of the subroutine integrate is known asautomatic quadrature. The algorithm for such a routine not only needs a suitable quadrature rule or rules,but also a strategy for achieving the requested accuracy and a system of bookkeeping to keep track of theprogress of the calculation. In particular, if function evaluations are to be minimized, it is necessary toensure that the function f(x) is not evaluated at the same point more than once, and also that functionvalues are stored if there is a possibility that they may be required more than once. Needless to say, thework done in bookkeeping should be negligible compared with the function evaluations if the algorithm isto be efficient.

Simpson’s rule, or the composite form of it, can be used as the basis of more or less sophisticated automatic

quadrature schemes. Two such schemes will be described for the integral∫ b

af(x)dx.

3.4.1 Simple automatic Simpson quadrature

A simple strategy is to use the composite Simpson rule with different values of h in such a way that functionevaluations can be re-used as much as possible. This can be done by evaluating the rule with an initial valueof h then re-evaluating using h/2, h/4, h/8, . . . until the process converges to the required accuracy.

‡ The IEEE Standard recommends an optional function called CopySign to test the sign of zero, but this israrely implemented.

44

Consider dividing the integral into first 1 and then 2 panels: this is the simplest case but illustrates theprinciple involved.

/2a+ h 3/4 ba a+ h1

klmn opqr stuvwxwyxyzxz{x{|}~�

1/4a+ h

Writing h = b − a, the first application of Simpson’s rule uses f(a), f(a + 12h) and f(b). The second

application re-uses these 3 values and also needs f(a + 14h) and f(a + 3

4h). The relevant formulae are

S1 =h

6[f(a) + 4f(a +

1

2h) + f(b)]

S2 =h

12[f(a) + 4f(a +

1

4h) + 2f(a +

1

2h) + 4f(a +

3

4h) + f(b)]

with errors given by

I − S1 = −f iv(ζ1)(h2 )5

90

I − S2 = −2f iv(ζ2)(h4 )5

90.

If we make the assumption that f iv(ζ1) ' f iv(ζ2) then, by subtracting these error estimates and re-arranging,we get

f iv(ζ2)(h2 )5

90=

24

1 − 24(S2 − S1)

from which we get the estimate

I − S2 =S2 − S1

15. (3.1)

In other words, the difference between the two approximations is about 15 times as big as the absolute errorin S2. The extra work needed to provide this useful error estimate is negligible.

It is clear that a successful and reasonably efficient automatic quadrature routine can be designed usingsuccessive halving of the interval h. However, consider how the above scheme would perform when integrating

45

the following function:

ba

In order to achieve greater efficiency, particularly for difficult integrals, it is better to allow the strategy todepend on the function f(x) itself. Such an automatic scheme is said to be adaptive.

3.4.2 Adaptive Simpson quadrature

In this scheme the interval (a, b) is first split up into two equal subintervals. Only the two rules S 1 and S2

are used but, if the estimated contribution to the error in each subinterval is too large then that subintervalis further sub-divided and the process repeated. The net effect is that the final sub-intervals are not all ofequal size, and these sizes depend on the behaviour of the integrand f(x) in that region.

The bookkeeping has to take account of all the nested subintervals and to ensure that the whole of (a, b)is covered. There is obviously more work involved in ensuring that function evaluations are re-used, butadaptive schemes are generally more efficient for difficult problems.

Exercise 3a

If implemented naively the adaptive Simpson method might fail. Why? Suggest a way of ensuringthat the algorithm terminates. Describe circumstances in which you would expect an adaptive schemeto be (a) more efficient, and (b) less efficient than a non-adaptive scheme.

3.5 BLAS: Basic Linear Algebra Subroutines

Finally, we will look briefly at a method currently used to enhance the portability of numerical librarysoftware.

Linear algebra is the name usually given to computations involving matrices, including the solution ofsimultaneous linear equations and eigenvalue problems. Since 1979 attempts have been made to definesubroutine interfaces for basic vector and matrix operations (e.g. to add two vectors, or to multiply a rowof a matrix by a vector) in order that more complex algorithms (e.g. Choleski factorization) can be codedin terms of these building blocks. These have been called by the acronym BLAS: Basic Linear AlgebraSubroutines. There are currently about 100 specified BLAS.

The idea has two significant advantages:

(1) New linear algebra algorithms can be implemented more easily.

(2) On special architectures, such as vector or array processors, algorithms using BLAS can be speededup by re-coding only the BLAS to take advantage of the architecture.

46

There are also two (relatively minor) disadvantages worth mentioning:

(1) In problems with a small matrix size, e.g. 2 or 3-dimensional graphics, the use of BLAS may mean thatsubroutine calls dominate the computation time. Consequently the use of BLAS may be relativelyinefficient. (This comment is obsolescent as processors become faster.)

(2) The detailed specification of BLAS is, to some extent, language-dependent and a de facto standardspecification needs to be established for each appropriate language.

In 1990 the idea of BLAS was extended by providing a portable library of higher level numerical linearalgebra routines, called LAPACK, specifically aimed at high-performance computers.

MRO’D

7.10.2005

47