Reputation: 1439
Is there a sub-O(n) way1 of replicating repeated summation of floating-point values in IEEE 754 arithmetic?
That is, is there a faster way of exactly2 replicating the result of the following pseudocode:
f64 notQuiteFMA(f64 acc, f64 f, bigint n) {
for (bigint i = 0; i < n; i++) {
acc = acc + f;
}
return acc;
}
?3
In real arithmetic the answer is straightforward - return acc + f*n
is sufficient. However, in finite precision the answer is far more complex. As one example of this, note that notQuiteFMA(0, 1, 1 << 256)
is somewhere around4 2^53, not the 2^256 that one would expect in infinite precision. (This example also shows why speeding this calculation up could be desirable - counting to 2^256 is rather impractical.)
acc
and f
.)x + 1 == x
first occurs around 2^53 when the addition is insufficient to increment the mantissa and the result is rounded down. Exact value depends on rounding mode I believe.Upvotes: 3
Views: 177
Reputation: 223693
We can implement an O(log n) solution by stepping through binades. This is enabled by the fact that all additions within a binade, possibly except the first, change the accumulating sum by the same amount, and therefore we can easily calculate exactly what sum would be produced by repeated additions until the end of the binade.
A binade is an interval [2e, 2e+1) or (−2e+1, −2e] for some integer e. For purposes of this answer, the entire minimum exponent range (including the minimum-exponent normals and all the subnormals) will be considered a single binade.
For a floating-point number x (a number representable in the floating-point format), define E(x) to be the exponent used in the representation of x. Given the definition of binade used this answer, E(x) denotes the binade x is in, except for its sign.
Expressions in code format represent floating-point arithmetic. x+y is the real-number-arithmetic result of adding x and y. x+y
is the floating-point result of adding x
and y
. x and x
are used for the same value as appropriate.
Floating-point addition is weakly monotonic for finite values: y
< z
implies x+y
<= x+z
, in any rounding mode, except when x
is an infinity and y
or z
is an infinity of the opposite sign, resulting in a NaN.
Let A0, A1, and A2 be the values of A
before, between, and after two executions of A += f
. Consider the case when E(A0) ≥ E(A1) = E(A2). Let g be A2−A1. g is representable, either by Sterbenz’ Lemma if A1 is normal or, if A1 is subnormal, because A1, A2, and f
are all subnormal (or zero) and there are no bits below the ULP of A2 to be lost in the addition. (Since g is representable, A2-A1
calculates it with no rounding error.)
We will show that all subsequent additions of f
that produce a value of A
with E(A
) = E(A1) increase A
by the same amount, g.
Let u be the ULP of A1. Let r be the remainder of |f
| modulo u. Let s be +1 if f is positive or −1 if f is negative. (The case when f is zero is trivial.) Using a directed rounding mode (toward −∞, toward zero, or toward +∞), the rounding error in any subsequent addition that does not change E(A
) is either always −r or always u−r, according to the rounding method and the sign of f
. Using round to nearest with ties to
away, the rounding error is always −sr if r < ½u and always s(u−r) otherwise. With round to nearest with ties to even, if r ≠ ½u, the rounding error is always −sr or always s(u−r), as with ties to away.
That leaves round to nearest with ties to even with r = ½u. Let b be the bit of f in the u position. In this case, the low bit of A1 must be even, because it was the result of adding f, which has remainder modulo u of ½u, to A0 (which is necessarily 0 modulo u since E(A0) ≥ E(A1)) to produce A1. In that addition, the real-number result was exactly midway between two representable values, so it was rounded to an even low digit (a multiple of 2u).
Given the low bit of A1 is even, we have two cases:
b is 0, so the addition of f
has a real-number-arithmetic result that is 0u + ½u modulo 2u ≡ ½u, and it is rounded to an even low digit, so the rounding error is −½u if the result is positive or +½u if it is negative. This then repeats for subsequent additions producing an A
with E(A
) = E(A1).
b is 1, so the addition of f
has a real-number-arithmetic result that is 1u + ½u modulo 2u ≡ 1½u, and it is rounded to an even low digit, so the rounding error is +½u if the result is positive and −½u if the result is negative. This then repeats for subsequent additions producing an A
with E(A
) = E(A1).
(To see why we go to the second addition within a binade before asserting all subsequent additions add the same amount, observe the addition that produced A0 may have involved bits from a prior A
with a lower ULP, in which case the real-number-arithmetic result might not have been exactly midway between
representable values, so A0 could have an odd low digit even using round to nearest with ties to even, and A2−A1 would not equal A1−A0.)
Given the above, we can calculate how many additions of g, say t can performed until just before the next binade is reached. Then we can calculate the exact value of A just before exiting the current binade, A+tg. Then a single addition gives us the first A in that binade, and another addition either takes us into a new binade or gives us the information needed to perform the calculation for the current binade. Then we can repeat this for each binade until the desired number of additions is reached.
Below is code demonstrating the concept. This code needs some improvement, per comments. I expect to work on it when time permits, which may be more than a few weeks from now. The E
function in the code takes the exponent from frexp
, not clamping it to the minimum exponent as described in the discussion above. This may slow the code but should not make it incorrect. I expect the code handles subnormals. Infinities and NaNs could be added via ordinary tests.
Calculation of t, the number of additions that can be performed before exiting the current binade, could use more discussion about the precision required and how rounding might affect it.
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
static float Sum0(float A, float f, int n)
{
if (n < 0) { f = -f, n = -n; }
while (n--) A += f;
return A;
}
// Return the exponent of x, designating which binade it is in.
static int E(float x)
{
int e;
frexp(x, &e);
return e;
}
static float Sum1(float A, float f, int n)
{
if (n < 0) { f = -f, n = -n; }
if (n == 0) return A;
while (1)
{
// Record initial exponent.
int e0 = E(A);
A += f;
if (--n == 0) return A;
// If exponent changed, continue to do another step.
if (E(A) != e0) continue;
/* Note it is possible that A crossed zero here and is in the "same"
binade with a different sign. With rounding modes toward zero or
round to nearest with ties to away, the rounding caused by adding f
may differ. However, if we have crossed zero, the next addition
will put A in a new binade, and the loop will exit this iteration
and start a new step.
*/
float A0 = A;
A += f;
if (--n == 0) return A;
// If exponent changed, continue to do another step.
if (E(A) != e0) continue;
// Calculate exact change from one addition.
float g = A - A0;
// If g is zero, no future additions will change the sum.
if (g == 0) return A;
/* Calculate how many additions can be done until just before the
next binade.
If A and f have the same sign, A is increasing in magnitude, and
the next binade is at the next higher exponent. Otherwise, the
next binade is at the bottom of the current binade.
We use copysign to get the right sign for the target. .5 is used
with ldexpf because C's specifications for frexp and ldexpf
normalize the significand to [.5, 1) instead of IEEE-754's [1, 2).
*/
int e1 = e0 + (signbit(A) == signbit(f));
double t = floor((ldexpf(copysignf(.5, A), e1) - A) / (double) g);
/* Temporary workaround to avoid incorrectly steppping to the edge of
a lower binade:
*/
if (t == 0) continue;
--t;
if (n <= t) return A + n*g;
A += t*g;
n -= t;
}
}
static int Check(float A, float f, int n)
{
float A0 = Sum0(A, f, n);
float A1 = Sum1(A, f, n);
return A0 != A1;
}
static void Test(float A, float f, int n)
{
float A0 = Sum0(A, f, n);
float A1 = Sum1(A, f, n);
if (A0 != A1)
{
printf("Adding %a = %.99g to %a = %.99g %d times:\n", f, f, A, A, n);
printf("\tSum0 -> %a = %.99g.\n", A0, A0);
printf("\tSum1 -> %a = %.99g.\n", A1, A1);
exit(EXIT_FAILURE);
}
}
int main(void)
{
srand(1);
for (int i = 0; i < 1000000; ++i)
{
if (i % 10000 == 0) printf("Tested %d cases.\n", i);
Test(rand() - RAND_MAX/2, rand() - RAND_MAX/2, rand() % (1<<20));
}
}
Upvotes: 5
Reputation: 9512
This is a tough problem...
At least we know that up to n=5
, the result is always n*f
in round to nearest even mode.
See for example Is 3*x+x always exact?
But then we gradually loose precision... At which pace?
I think that we can predict the case of overflow, so I will not bother with that. We can thus ignore the exponent part and focus on the significand.
The case of gradual underflow is also not very much interesting, because addition will be exact until the biased exponent reach 1.
So we can focus on significand in (1,2( interval, 1 being un-interesting.
I tried following Smalltalk snippet in order to get an idea of error bounds:
sample := Array new: 60000.
x := 1.0.
00001 to: 10000 do: [:i | sample at: i put: (x := x successor)].
x := 2.0.
10001 to: 20000 do: [:i | sample at: i put: (x := x predecessor)].
x := 1.5.
20001 to: 30000 do: [:i | sample at: i put: (x := x predecessor)].
x := 1.5.
30001 to: 40000 do: [:i | sample at: i put: (x := x successor)].
x := 1.125.
40001 to: 50000 do: [:i | sample at: i put: (x := x predecessor)].
x := 1.125.
50001 to: 60000 do: [:i | sample at: i put: (x := x successor)].
(3 to: 10) collect: [:i |
| setOfErrors |
setOfErrors := sample collect: [:f | ((1 to: 1<<i) inject: 0 into: [:acc :ignored | acc+f]) - (1<<i*f) / f ulp] as: Set.
{i. setOfErrors size. setOfErrors max log2}].
1<<i
is a left shift, so equivalent to 2^i
.
So the error bounds for the case when n is a power of two, for above significand sample is:
#(#(3 5 4.0) #(4 9 6.0) #(5 17 8.0) #(6 33 10.0) #(7 65 12.0) #(8 129 14.0) #(9 257 16.0) #(10 513 18.0))
That is a maximum error of 2^(2*(i-1))*ulp(f)
when n=2^i
, with all the 2^(i-1)+1
possible combinations of k*2^i*ulp(f)
errors, k=-2^(i-2)...2^(i-2)
.
That's many different possibilities until we reach 2 raised to the number of significand bits... So i doubt that you can find any easy function of the bit pattern, just for the default rounding mode.
EDIT
with the ArbitraryPrecisionFloat package for Squeak Smalltalk, I tried to generalize the formulation for Float with p significand bits.
For example, with 11 bit significand:
p := 11.
(3 to: p + 1) collect: [:power |
| n setOfErrors |
n := 1 << power.
setOfErrors := (1 << p to: 2<<p - 1) collect: [:significand |
| f acc |
f := (significand asArbitraryPrecisionFloatNumBits: p) timesTwoPower: 0 - p.
acc := 0. n timesRepeat: [acc := acc + f].
(n*f-acc/(acc ulp)) asFraction] as: Set.
{power. setOfErrors max log2 printShowingDecimalPlaces: 2. setOfErrors size}].
We get this results:
#(#(3 '1.00' 5) #(4 '2.00' 9) #(5 '3.00' 17) #(6 '3.91' 32) #(7 '4.91' 61) #(8 '5.88' 120) #(9 '6.77' 220) #(10 '7.41' 354) #(11 '7.41' 512) #(12 '10.00' 1024))
Note that past 2^(p+1)
operands, the sum would not change anymore.
Up to power=floor(p/2)
, there are 2^(power-1)+1
different results as shown by previous sample.
Then the number of different results increases more slowly, but at the end, it reaches 2^(p-1)
, that is 1 different result for every 2 different significands.
This indicates that notQuiteFMA
is going to be rather complex.
Inquiring about the pattern of (n*f-acc/(acc ulp))
, we see that it seems to depends on the ceil(log2(n))
last bits of f significand, until the sum reaches a new binade...
So, maybe you could inquire something like:
n*f
versus that of f
?n*(f+k*ulp(f))
follows a predictible pattern for the successors of f?If you can answer positively, then there are some chances that you can devise an accelerated notQuiteFMA
function...
Upvotes: 0
Reputation: 54733
I can answer this question, although I don't think you will like it.
The value will stop incrementing as soon as the number rolls over to 53 significant bits. We can compute that. Here is a script in Python that shows the point at which adding 1 to a float no longer changes the float:
import struct
j = 1
for i in range(1,58):
j = 2*j
r = float(j)
s = r+1
if r == s:
pat = struct.pack('>d',r)
print(pat.hex())
print(i,j,r,s)
break
Output:
4340000000000000
53 9007199254740992 9007199254740992.0 9007199254740992.0
The first line is the hex representation of that value. So, it is the hidden 1 with an exponent of 52.
Upvotes: -3