user916968
user916968

Reputation: 315

Add a bunch of floating-point numbers with JavaScript, what is the error bound on the sum?

When I add a bunch of floating-point numbers with JavaScript, what is the error bound on the sum? What error bound should be used to check if two sums are equal?

In a simple script, I add a bunch of floating-point numbers and compare sums. I notice that sometimes the result is not correct (two sums that should be equal are not). I am pretty weak at numerical analysis, but even after reviewing Is floating point math broken? and What Every Computer Scientist Should Know About Floating-Point Arithmetic and Comparing Floating Point Numbers, 2012 Edition I am confused about how best to compare floating-point sums in JavaScript.

First, I was confused by: The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded (as if they were computed exactly then rounded to the nearest floating-point number). If JavaScript is based on the IEEE standard, how can 0.1 + 0.2 != 0.3?

I think I answered this for myself: It's easier for me to think about an example in base 10. If 1/3 is approximated 0.333...333 and 2/3 is approximated 0.666...667, 1/3 + 1/3 = 0.666...666 is exactly rounded (it is the exact sum of two approximations) but != 0.666...667. Intermediate results of exactly rounded operations are still rounded, which can still introduce error.

How big is machine epsilon? JavaScript floating-point numbers are apparently 64-bits, and apparently IEEE double precision format machine epsilon is about 1e-16?

When I add a bunch (n) of floating-point numbers (naive summation, without pairwise or Kahan summation), what is the error bound on the sum? Intuitively it is proportional to n. The worst-case example I can think of (again in base 10) is 2/3 - 1/3 - 1/3 + 2/3 - 1/3 - 1/3 + etc. I think each iteration will increment the error term by 1 ULP while the sum remains zero, so both the error term and relative error will grow without bound?

In the section "Errors in Summation" Goldberg is more precise (error term is bounded by n * machine epsilon * sum of the absolute values) but also points out that if the sum is being done in an IEEE double precision format, machine epsilon is about 1e-16, so n * machine epsilon will be much less than 1 for any reasonable value of n (n much less than 1e16). How can this error bound be used to check if two floating-point sums are equal? What relationship between the sums, 1, 1e-16, n, etc. must be true if they are equal?

Another intuition: If the bunch of numbers are all positive (mine are) then although the error term can grow without bound, the relative error will not, because the sum must grow at the same time. In base 10, the worst-case example I can think of (in which the error term grows fastest while the sum grows slowest) is if 1.000...005 is approximated 1.000...000. Repeatedly adding this number will increment the error term by 1/2 ULP (of the summand, 0.000...005) while incrementing the sum by 1 first place unit. The worst relative error is 4.5 ULP (0.000...045, when the sum is 9.000...000) which is (base - 1) / 2 ULP which is 1/2 ULP in base 2?

If two floating-point sums are equal, then their absolute difference must be less than twice the error bound, which is 1 ULP in base 2? So in JavaScript, Math.abs(a - b) < a * 1e-16 + b * 1e-16?

Comparing Floating Point Numbers, 2012 Edition describes another technique for comparing floating-point numbers, also based on relative error. In JavaScript, is it possible to find the number of representable numbers between two floating-point numbers?

Upvotes: 3

Views: 1105

Answers (1)

Eric Postpischil
Eric Postpischil

Reputation: 222900

The maximum possible error in the sum of n numbers added consecutively is proportional to n2, not to n.

The key reason for this is that each addition may have some error proportional to its sum, and those sums keep growing as more additions are made. In the worse case, the sums grow in proportion to n (if you add n x’s together, you get nx). So, in the end, there are n sums that have grown in proportion to n, yielding a total possible error proportional to n2.

JavaScript is specified by the ECMA Language Specification, which says that IEEE-754 64-bit binary floating-point is used and round-to-nearest mode is used. I do not see any provision allowing extra precision as some languages do.

Suppose all numbers have magnitude at most b, where b is some representable value. If your numbers have a distribution that can be characterized more specifically, then an error bound tighter than described below might be derived.

When the exact mathematical result of an operation is y, and there is no overflow, then the maximum error in IEEE-754 binary floating-point with round-to-nearest mode is 1/2 ULP(y), where ULP(y) is the distance between the two representable values just above and below y in magnitude (using y itself as the “above” value if it is exactly representable). This is the maximum error because y is always either exactly on the midpoint between two bordering values or is on one side or the other, so the distance from y to one of the bordering values is at most the distance from the midpoint to a bordering value.

(In IEEE-754 64-bit binary, the ULP of all numbers less than 2-1022 in magnitude is 2-1074. The ULP of all larger powers of two is 2-52 times the number; e.g., 2-52 for 1. The ULP for non-powers of two is the ULP of the largest power of two smaller than the number, e.g., 2-52 for any number above 1 and below 2.)

When the first two numbers in a series are added, the exact result is at most 2b, so the error in this first addition is at most 1/2 ULP(2b). When the third number is added, the result is at most 3b, so the error in this addition is at most 1/2 ULP(3b). The total error so far is at most 1/2 (ULP(2b) + ULP(3b)).

At this point, the addition could round up, so the partial sum so far could be slightly more than 3b, and the next sum could be slightly more than 4b. If we want to compute a strict bound on the error, we could use an algorithm such as:

Let bound = 0.
For i = 2 to n:
    bound += 1/2 ULP(i*b + bound).

That is, for each of the additions that will be performed, add an error bound that is 1/2 the ULP of the largest conceivable result given the actual values added plus all the previous errors. (The pseudo-code above would need to be implemented with extended precision or with rounding upward in order to retain mathematical rigor.)

Thus, given only the number of numbers to be added and a bound on their magnitudes, we can pre-compute an error bound without knowing their specific values in advance. This error bound will grow in proportion to n2.

If this potential error is too high, there are ways to reduce it:

  • Instead of adding numbers consecutively, they can be split in half, and the sums of the two halves can be added. Each of the halves can be recursively summed in this way. When this is done, the maximum magnitudes of the partial sums will be smaller, so the bounds on their errors will be smaller. E.g., with consecutive additions of 1, we have sums 2, 3, 4, 5, 6, 7, 8, but, with this splitting, we have parallel sums of 2, 2, 2, 2, then 4, 4, then 8.
  • We can sort the numbers and keep the sums smaller by adding numbers that cancel each other out (complementary positive and negative numbers) or adding smaller numbers first.
  • The Kahan summation algorithm can be employed to get some extended precision without much extra effort.

Considering one particular case:

Consider adding n non-negative numbers, producing a calculated sum s. Then the error in s is at most (n-1)/2 • ULP(s).

Proof: Each addition has error at most 1/2 ULP(x), where x is the calculated value of that addition. Since we are adding non-negative values, the accumulating sum never decreases, so it is never more than s, and its ULP is at most the ULP of s. So the n-1 additions produce at most n-1 errors of ULP(s)/2.

Upvotes: 4

Related Questions