Is there a sub-O(n) way 1 of replicating repeated summation of floating-point values in IEEE 754 arithmetic? That is, is there a faster way of exactly 2 replicating the result of the following pseudocode: f64 notQuiteFMA(f64 acc, f64 f, bigint n) { for (bigint i = 0; i < n; i++) { acc = acc + f; } return acc; } ? 3 In real arithmetic the answer is straightforward - return acc + f*n is sufficient. However, in finite precision the answer is far more complex. As one example of this, note that notQuiteFMA(0, 1, 1 << 256) is somewhere around 4 2^53, not the 2^256 that one would expect in infinite precision. (This example also shows why speeding this calculation up could be desirable - counting to 2^256 is rather impractical.) I am aware that this entire thing could be technically done in O(1) time 5 with a 2^128 entry lookup table with each entry having an array of 2^64 elements; kindly ignore such galactic algorithms. (Pigeonhole principle - repeated addition can not progress more than 2^64 steps before entering a loop - so 'just' store the entire prefix and the number of steps in the final loop for every initial value of acc and f .) Bit-exact given the current rounding mode (ignoring the can of worms that is non-signaling NaNs at least). Treat this as pseudocode - i.e. the sequence of double-precision floating-point additions described. I am aware there are cases in certain programming languages where floating-point calculations can be done in higher precision / reordered to make things more SIMD-friendly / etc / etc; I am not interested in these cases for this question. 64-bit floats have a 52 bit mantissa; x + 1 == x first occurs around 2^53 when the addition is insufficient to increment the mantissa and the result is rounded down. Exact value depends on rounding mode I believe. Assuming a computational model where modulo of a bigint by a constant-bounded value is O(1), at least.

Reputation: 1439

Exactly replicating repeated addition of floating-point values

Is there a sub-O(n) way¹ of replicating repeated summation of floating-point values in IEEE 754 arithmetic?

That is, is there a faster way of exactly² replicating the result of the following pseudocode:

f64 notQuiteFMA(f64 acc, f64 f, bigint n) {
    for (bigint i = 0; i < n; i++) {
        acc = acc + f;
    }
    return acc;
}

?³

In real arithmetic the answer is straightforward - return acc + f*n is sufficient. However, in finite precision the answer is far more complex. As one example of this, note that notQuiteFMA(0, 1, 1 << 256) is somewhere around⁴ 2^53, not the 2^256 that one would expect in infinite precision. (This example also shows why speeding this calculation up could be desirable - counting to 2^256 is rather impractical.)

I am aware that this entire thing could be technically done in O(1) time⁵ with a 2^128 entry lookup table with each entry having an array of 2^64 elements; kindly ignore such galactic algorithms. (Pigeonhole principle - repeated addition can not progress more than 2^64 steps before entering a loop - so 'just' store the entire prefix and the number of steps in the final loop for every initial value of acc and f.)
Bit-exact given the current rounding mode (ignoring the can of worms that is non-signaling NaNs at least).
Treat this as pseudocode - i.e. the sequence of double-precision floating-point additions described. I am aware there are cases in certain programming languages where floating-point calculations can be done in higher precision / reordered to make things more SIMD-friendly / etc / etc; I am not interested in these cases for this question.
64-bit floats have a 52 bit mantissa; x + 1 == x first occurs around 2^53 when the addition is insufficient to increment the mantissa and the result is rounded down. Exact value depends on rounding mode I believe.
Assuming a computational model where modulo of a bigint by a constant-bounded value is O(1), at least.

Upvotes: 3

Answers (3)

Eric Postpischil

Reputation: 223693

Conclusion

We can implement an O(log n) solution by stepping through binades. This is enabled by the fact that all additions within a binade, possibly except the first, change the accumulating sum by the same amount, and therefore we can easily calculate exactly what sum would be produced by repeated additions until the end of the binade.

Preliminaries

A binade is an interval [2^e, 2^e+1) or (−2^e+1, −2^e] for some integer e. For purposes of this answer, the entire minimum exponent range (including the minimum-exponent normals and all the subnormals) will be considered a single binade.

For a floating-point number x (a number representable in the floating-point format), define E(x) to be the exponent used in the representation of x. Given the definition of binade used this answer, E(x) denotes the binade x is in, except for its sign.

Expressions in code format represent floating-point arithmetic. x+y is the real-number-arithmetic result of adding x and y. x+y is the floating-point result of adding x and y. x and x are used for the same value as appropriate.

Floating-point addition is weakly monotonic for finite values: y < z implies x+y <= x+z, in any rounding mode, except when x is an infinity and y or z is an infinity of the opposite sign, resulting in a NaN.

Discussion

Let A₀, A₁, and A₂ be the values of A before, between, and after two executions of A += f. Consider the case when E(A₀) ≥ E(A₁) = E(A₂). Let g be A₂−A₁. g is representable, either by Sterbenz’ Lemma if A₁ is normal or, if A₁ is subnormal, because A₁, A₂, and f are all subnormal (or zero) and there are no bits below the ULP of A₂ to be lost in the addition. (Since g is representable, A2-A1 calculates it with no rounding error.)

We will show that all subsequent additions of f that produce a value of A with E(A) = E(A₁) increase A by the same amount, g.

Let u be the ULP of A₁. Let r be the remainder of |f| modulo u. Let s be +1 if f is positive or −1 if f is negative. (The case when f is zero is trivial.) Using a directed rounding mode (toward −∞, toward zero, or toward +∞), the rounding error in any subsequent addition that does not change E(A) is either always −r or always u−r, according to the rounding method and the sign of f. Using round to nearest with ties to away, the rounding error is always −sr if r < ½u and always s(u−r) otherwise. With round to nearest with ties to even, if r ≠ ½u, the rounding error is always −sr or always s(u−r), as with ties to away.

That leaves round to nearest with ties to even with r = ½u. Let b be the bit of f in the u position. In this case, the low bit of A₁ must be even, because it was the result of adding f, which has remainder modulo u of ½u, to A₀ (which is necessarily 0 modulo u since E(A₀) ≥ E(A₁)) to produce A₁. In that addition, the real-number result was exactly midway between two representable values, so it was rounded to an even low digit (a multiple of 2u).

Given the low bit of A₁ is even, we have two cases:

b is 0, so the addition of f has a real-number-arithmetic result that is 0u + ½u modulo 2u ≡ ½u, and it is rounded to an even low digit, so the rounding error is −½u if the result is positive or +½u if it is negative. This then repeats for subsequent additions producing an A with E(A) = E(A₁).

b is 1, so the addition of f has a real-number-arithmetic result that is 1u + ½u modulo 2u ≡ 1½u, and it is rounded to an even low digit, so the rounding error is +½u if the result is positive and −½u if the result is negative. This then repeats for subsequent additions producing an A with E(A) = E(A₁).

(To see why we go to the second addition within a binade before asserting all subsequent additions add the same amount, observe the addition that produced A₀ may have involved bits from a prior A with a lower ULP, in which case the real-number-arithmetic result might not have been exactly midway between representable values, so A₀ could have an odd low digit even using round to nearest with ties to even, and A₂−A₁ would not equal A₁−A₀.)

Implementation

Given the above, we can calculate how many additions of g, say t can performed until just before the next binade is reached. Then we can calculate the exact value of A just before exiting the current binade, A+tg. Then a single addition gives us the first A in that binade, and another addition either takes us into a new binade or gives us the information needed to perform the calculation for the current binade. Then we can repeat this for each binade until the desired number of additions is reached.

Below is code demonstrating the concept. This code needs some improvement, per comments. I expect to work on it when time permits, which may be more than a few weeks from now. The E function in the code takes the exponent from frexp, not clamping it to the minimum exponent as described in the discussion above. This may slow the code but should not make it incorrect. I expect the code handles subnormals. Infinities and NaNs could be added via ordinary tests.

Calculation of t, the number of additions that can be performed before exiting the current binade, could use more discussion about the precision required and how rounding might affect it.

#include <math.h>
#include <stdio.h>
#include <stdlib.h>


static float Sum0(float A, float f, int n)
{
    if (n < 0) { f = -f, n = -n; }
    while (n--) A += f;
    return A;
}


//  Return the exponent of x, designating which binade it is in.
static int E(float x)
{
    int e;
    frexp(x, &e);
    return e;
}


static float Sum1(float A, float f, int n)
{
    if (n < 0) { f = -f, n = -n; }

    if (n == 0) return A;

    while (1)
    {
        //  Record initial exponent.
        int e0 = E(A);

        A += f;
        if (--n == 0) return A;

        //  If exponent changed, continue to do another step.
        if (E(A) != e0) continue;

        /*  Note it is possible that A crossed zero here and is in the "same"
            binade with a different sign.  With rounding modes toward zero or
            round to nearest with ties to away, the rounding caused by adding f
            may differ.  However, if we have crossed zero, the next addition
            will put A in a new binade, and the loop will exit this iteration
            and start a new step.
        */

        float A0 = A;
        A += f;
        if (--n == 0) return A;

        //  If exponent changed, continue to do another step.
        if (E(A) != e0) continue;

        //  Calculate exact change from one addition.
        float g = A - A0;

        //  If g is zero, no future additions will change the sum.
        if (g == 0) return A;

        /*  Calculate how many additions can be done until just before the
            next binade.

            If A and f have the same sign, A is increasing in magnitude, and
            the next binade is at the next higher exponent.  Otherwise, the
            next binade is at the bottom of the current binade.

            We use copysign to get the right sign for the target.  .5 is used
            with ldexpf because C's specifications for frexp and ldexpf
            normalize the significand to [.5, 1) instead of IEEE-754's [1, 2).
        */
        int e1 = e0 + (signbit(A) == signbit(f));

        double t = floor((ldexpf(copysignf(.5, A), e1) - A) / (double) g);

        /*  Temporary workaround to avoid incorrectly steppping to the edge of
            a lower binade:
        */
        if (t == 0) continue;
        --t;

        if (n <= t) return A + n*g;

        A += t*g;
        n -= t;
    }
}


static int Check(float A, float f, int n)
{
    float A0 = Sum0(A, f, n);
    float A1 = Sum1(A, f, n);
    return A0 != A1;
}


static void Test(float A, float f, int n)
{
    float A0 = Sum0(A, f, n);
    float A1 = Sum1(A, f, n);
    if (A0 != A1)
    {
        printf("Adding %a = %.99g to %a = %.99g %d times:\n", f, f, A, A, n);
        printf("\tSum0 -> %a = %.99g.\n", A0, A0);
        printf("\tSum1 -> %a = %.99g.\n", A1, A1);
        exit(EXIT_FAILURE);
    }
}


int main(void)
{
    srand(1);

    for (int i = 0; i < 1000000; ++i)
    {
        if (i % 10000 == 0) printf("Tested %d cases.\n", i);
        Test(rand() - RAND_MAX/2, rand() - RAND_MAX/2, rand() % (1<<20));
    }
}

Upvotes: 5

aka.nice

Reputation: 9512

This is a tough problem...

At least we know that up to n=5, the result is always n*f in round to nearest even mode.
See for example Is 3*x+x always exact?

But then we gradually loose precision... At which pace?

I think that we can predict the case of overflow, so I will not bother with that. We can thus ignore the exponent part and focus on the significand.

The case of gradual underflow is also not very much interesting, because addition will be exact until the biased exponent reach 1.

So we can focus on significand in (1,2( interval, 1 being un-interesting.

I tried following Smalltalk snippet in order to get an idea of error bounds:

sample := Array new: 60000.
x := 1.0.
00001 to: 10000 do: [:i | sample at: i put: (x := x successor)].
x := 2.0.
10001 to: 20000 do: [:i | sample at: i put: (x := x predecessor)].
x := 1.5.
20001 to: 30000 do: [:i | sample at: i put: (x := x predecessor)].
x := 1.5.
30001 to: 40000 do: [:i | sample at: i put: (x := x successor)].
x := 1.125.
40001 to: 50000 do: [:i | sample at: i put: (x := x predecessor)].
x := 1.125.
50001 to: 60000 do: [:i | sample at: i put: (x := x successor)].
(3 to: 10) collect: [:i |
    | setOfErrors |
    setOfErrors := sample collect: [:f | ((1 to: 1<<i) inject: 0 into: [:acc :ignored | acc+f]) - (1<<i*f) / f ulp] as: Set.
    {i. setOfErrors size. setOfErrors max log2}].

1<<i is a left shift, so equivalent to 2^i.
So the error bounds for the case when n is a power of two, for above significand sample is:

 #(#(3 5 4.0) #(4 9 6.0) #(5 17 8.0) #(6 33 10.0) #(7 65 12.0) #(8 129 14.0) #(9 257 16.0) #(10 513 18.0))

That is a maximum error of 2^(2*(i-1))*ulp(f) when n=2^i, with all the 2^(i-1)+1 possible combinations of k*2^i*ulp(f) errors, k=-2^(i-2)...2^(i-2).

That's many different possibilities until we reach 2 raised to the number of significand bits... So i doubt that you can find any easy function of the bit pattern, just for the default rounding mode.

EDIT

with the ArbitraryPrecisionFloat package for Squeak Smalltalk, I tried to generalize the formulation for Float with p significand bits.

For example, with 11 bit significand:

p := 11.
(3 to: p + 1) collect: [:power |
    | n setOfErrors |
    n := 1 << power.
    setOfErrors := (1 << p to: 2<<p - 1) collect: [:significand |
        | f acc |
        f := (significand asArbitraryPrecisionFloatNumBits: p) timesTwoPower: 0 - p.
        acc := 0. n timesRepeat: [acc := acc + f].
        (n*f-acc/(acc ulp)) asFraction] as: Set.
    {power. setOfErrors max log2 printShowingDecimalPlaces: 2. setOfErrors size}].

We get this results:

#(#(3 '1.00' 5) #(4 '2.00' 9) #(5 '3.00' 17) #(6 '3.91' 32) #(7 '4.91' 61) #(8 '5.88' 120) #(9 '6.77' 220) #(10 '7.41' 354) #(11 '7.41' 512) #(12 '10.00' 1024))

Note that past 2^(p+1) operands, the sum would not change anymore.

Up to power=floor(p/2), there are 2^(power-1)+1 different results as shown by previous sample.

Then the number of different results increases more slowly, but at the end, it reaches 2^(p-1), that is 1 different result for every 2 different significands.

This indicates that notQuiteFMA is going to be rather complex.

Inquiring about the pattern of (n*f-acc/(acc ulp)), we see that it seems to depends on the ceil(log2(n)) last bits of f significand, until the sum reaches a new binade...

So, maybe you could inquire something like:

what is the binade (exponent) of n*f versus that of f?
does the sequence of errors of n*(f+k*ulp(f)) follows a predictible pattern for the successors of f?

If you can answer positively, then there are some chances that you can devise an accelerated notQuiteFMA function...

Upvotes: 0

Tim Roberts

Reputation: 54733

I can answer this question, although I don't think you will like it.

The value will stop incrementing as soon as the number rolls over to 53 significant bits. We can compute that. Here is a script in Python that shows the point at which adding 1 to a float no longer changes the float:

import struct

j = 1
for i in range(1,58):
    j = 2*j
    r = float(j)
    s = r+1
    if r == s:
        pat = struct.pack('>d',r)
        print(pat.hex())
        print(i,j,r,s)
        break

Output:

4340000000000000
53 9007199254740992 9007199254740992.0 9007199254740992.0

The first line is the hex representation of that value. So, it is the hidden 1 with an exponent of 52.

Upvotes: -3