Reputation: 61
There is a sequence:
x(n+2)=9/4*x(n+1)-1/2*x(n)
x(1)=1/3,x(2)=1/12
The exact result is x(n)=4^(1-n)/3
I wish to show the rounding error of x(60) in calculation.
My code is
#include <stdio.h>
#include <math.h>
int main(void)
{
float x[60];
x[0] = 1./3;
x[1] = 1./12;
for (int i = 2; i < 60; i++) {
x[i] = 9./4*x[i-1]-1./2*x[i-2];
}
double y[60];
y[0] = 1./3;
y[1] = 1./12;
for (int i = 2; i < 60; i++) {
y[i] = 9./4*y[i-1]-1./2*y[i-2];
}
printf("single:%g, double:%g, exact:%g\n", x[59], y[59], pow(4,-59)/3);
return 0;
}
I compile it with gcc:
gcc seq.c
The output is:
single:1.00309e-36, double:1.71429, exact:1.00309e-36
If I change the code above like this:
#include <stdio.h>
#include <math.h>
int main(void)
{
float x[60];
x[0] = 1.f/3;
x[1] = 1.f/12;
for (int i = 2; i < 60; i++) {
x[i] = 9.f/4*x[i-1]-1.f/2*x[i-2];
}
double y[60];
y[0] = 1./3;
y[1] = 1./12;
for (int i = 2; i < 60; i++) {
y[i] = 9./4*y[i-1]-1./2*y[i-2];
}
printf("single:%g, double:%g, exact:%g\n", x[59], y[59], pow(4,-59)/3);
return 0;
}
where 'f' is added after constant floating-point number for calculation of x-array.
The output seems normal:
single:-9.2035e+08, double:1.71429, exact:1.00309e-36
My question is:
Why does the result of float data type equal to the exact result in the first situation?
What does the compiler do?
Upvotes: 5
Views: 318
Reputation: 44340
It seems to me that you are well aware that floating point rounding errors can lead to completely wrong results. Actually, it seems that you are more surprised by getting the "correct" result in example 1 than you are surprised by the wrong result in example 2.
Well, rounding errors may lead to extremely wrong results but you can't assume that roundings errors will always lead to extremely wrong results. Sometimes the rounding errors will only cause a minor error and at other times the rounding errors will cause the whole calculation to be unstable and lead to extreme errors.
The answer from @user2357112 https://stackoverflow.com/a/55194247/4386427 gives a good description of your specific case.
But one part of your question is still unanswered:
What does the compiler do?
I assume that you are asking why this code
a) x[i] = 9./4*x[i-1]-1./2*x[i-2];
gives different results than this code
b) x[i] = 9.f/4*x[i-1]-1.f/2*x[i-2];
^ ^
The answer is that case a) requires that the calculations is carried out with double precision as 9.
has the type double while case b) allows that the calculation is carried out with single precision as all types are floats.
If the compiler decides to use single precision operations instead of double the rounding errors will be different in case a) and case b). Different rounding errors (may) lead to different results as discussed above.
There is no single answer explaining what the compiler will do as different compilers may give different result. Below is one example generated using https://godbolt.org/ and x86-64 gcc 8.3 and -O0.
For simplicity it only covers 9./4*x[i-1]
versus 9.f/4*x[i-1]
and I have only copied the lines that differs.
Case 9./4*x[i-1]:
movss xmm0, DWORD PTR [rax]
cvtss2sd xmm1, xmm0
movsd xmm0, QWORD PTR .LC0[rip]
mulsd xmm0, xmm1
and
Case 9.f/4*x[i-1]:
movss xmm1, DWORD PTR [rax]
movss xmm0, DWORD PTR .LC0[rip]
mulss xmm0, xmm1
cvtss2sd xmm0, xmm0
As it can be seen the difference is use of single precision (mulss
) and double precision (mulsd
).
So to conclude:
What does the compiler do?
It uses different floating point precision for the calculation leading to different rounding errors which again leads to different results.
Upvotes: 0
Reputation: 281842
float
is not more precise than double
, and your float
computation has not given you the exact result of pow(4,-59)/3
.
What's going on is that your recurrence is designed to take a tiny rounding error and amplify it every iteration. In exact math, each value should be exactly one quarter of the previous value, but if it's not exactly a quarter due to rounding error, the difference gets magnified on every step.
Since a quarter of a representable value is always representable (until you hit subnormal numbers and underflow issues), the recurrence has an additional property: if the computation is performed in a precision sufficiently in excess of the precision with which the results are stored, then rounding the results to lower precision for storage will round to exactly a quarter of the previous value. (The choice of 9./4
and 1./2
factors give the recurrence an even stronger version of this property, where the result is exactly a quarter of the old value even before rounding for storage.)
With doubles, with the compiler and compiler settings you're using, the rounding error occurs and gets amplified. With floats, the computations are performed in double precision, eliminating rounding error in the recurrence steps due to the properties described above, so there is nothing to amplify. If the computation for doubles had been performed at long double precision, the same thing would have happened.
Let's take a closer look at the exact values produced, by using the %a
format specifier to print floating-point numbers in hexadecimal notation. That looks like 0x1.5555555555558p-6
, where the part between 0x
and p
is a hexadecimal number, and the part after the p
is a decimal number representing a power of two to multiply the hexadecimal number by. Here, 0x1.5555555555558p-6
represents 0x1.5555555555558 times 2^-6. %a
format always prints the exact value of a float or double, unlike %g
, which rounds.
We'll also show a third computation, storing results as doubles, but doing the math in long double precision.
Our altered program looks like this:
#include <stdio.h>
#include <math.h>
int main(void)
{
float x[60];
x[0] = 1./3;
x[1] = 1./12;
for (int i = 2; i < 60; i++) {
x[i] = 9./4*x[i-1]-1./2*x[i-2];
}
double y[60];
y[0] = 1./3;
y[1] = 1./12;
for (int i = 2; i < 60; i++) {
y[i] = 9./4*y[i-1]-1./2*y[i-2];
}
double z[60];
z[0] = 1./3;
z[1] = 1./12;
for (int i = 2; i < 60; i++) {
z[i] = (long double) 9./4*z[i-1] - (long double) 1./2*z[i-2];
}
printf("float:%a, double:%a, double2:%a, formula:%a\n", x[59], y[59], z[59], pow(4,-59)/3);
for (int i = 0; i < 60; i++) {
printf("%d %a %a %a\n", i, x[i], y[i], z[i]);
}
return 0;
}
And here's the output. I was going to abridge this, but it turns out it's hard to do that without obscuring interesting parts of the pattern:
float:0x1.555556p-120, double:0x1.b6db6db6db6dap+0, double2:0x1.5555555555555p-120, formula:0x1.5555555555555p-120
0 0x1.555556p-2 0x1.5555555555555p-2 0x1.5555555555555p-2
1 0x1.555556p-4 0x1.5555555555555p-4 0x1.5555555555555p-4
2 0x1.555556p-6 0x1.5555555555558p-6 0x1.5555555555555p-6
3 0x1.555556p-8 0x1.555555555557p-8 0x1.5555555555555p-8
4 0x1.555556p-10 0x1.555555555563p-10 0x1.5555555555555p-10
5 0x1.555556p-12 0x1.5555555555c3p-12 0x1.5555555555555p-12
6 0x1.555556p-14 0x1.5555555558c3p-14 0x1.5555555555555p-14
7 0x1.555556p-16 0x1.5555555570c3p-16 0x1.5555555555555p-16
8 0x1.555556p-18 0x1.5555555630c3p-18 0x1.5555555555555p-18
9 0x1.555556p-20 0x1.5555555c30c3p-20 0x1.5555555555555p-20
10 0x1.555556p-22 0x1.5555558c30c3p-22 0x1.5555555555555p-22
11 0x1.555556p-24 0x1.5555570c30c3p-24 0x1.5555555555555p-24
12 0x1.555556p-26 0x1.5555630c30c3p-26 0x1.5555555555555p-26
13 0x1.555556p-28 0x1.5555c30c30c3p-28 0x1.5555555555555p-28
14 0x1.555556p-30 0x1.5558c30c30c3p-30 0x1.5555555555555p-30
15 0x1.555556p-32 0x1.5570c30c30c3p-32 0x1.5555555555555p-32
16 0x1.555556p-34 0x1.5630c30c30c3p-34 0x1.5555555555555p-34
17 0x1.555556p-36 0x1.5c30c30c30c3p-36 0x1.5555555555555p-36
18 0x1.555556p-38 0x1.8c30c30c30c3p-38 0x1.5555555555555p-38
19 0x1.555556p-40 0x1.8618618618618p-39 0x1.5555555555555p-40
20 0x1.555556p-42 0x1.e186186186186p-39 0x1.5555555555555p-42
21 0x1.555556p-44 0x1.bc30c30c30c3p-38 0x1.5555555555555p-44
22 0x1.555556p-46 0x1.b786186186185p-37 0x1.5555555555555p-46
23 0x1.555556p-48 0x1.b6f0c30c30c3p-36 0x1.5555555555555p-48
24 0x1.555556p-50 0x1.b6de186186185p-35 0x1.5555555555555p-50
25 0x1.555556p-52 0x1.b6dbc30c30c3p-34 0x1.5555555555555p-52
26 0x1.555556p-54 0x1.b6db786186185p-33 0x1.5555555555555p-54
27 0x1.555556p-56 0x1.b6db6f0c30c3p-32 0x1.5555555555555p-56
28 0x1.555556p-58 0x1.b6db6de186185p-31 0x1.5555555555555p-58
29 0x1.555556p-60 0x1.b6db6dbc30c3p-30 0x1.5555555555555p-60
30 0x1.555556p-62 0x1.b6db6db786185p-29 0x1.5555555555555p-62
31 0x1.555556p-64 0x1.b6db6db6f0c3p-28 0x1.5555555555555p-64
32 0x1.555556p-66 0x1.b6db6db6de185p-27 0x1.5555555555555p-66
33 0x1.555556p-68 0x1.b6db6db6dbc3p-26 0x1.5555555555555p-68
34 0x1.555556p-70 0x1.b6db6db6db785p-25 0x1.5555555555555p-70
35 0x1.555556p-72 0x1.b6db6db6db6fp-24 0x1.5555555555555p-72
36 0x1.555556p-74 0x1.b6db6db6db6ddp-23 0x1.5555555555555p-74
37 0x1.555556p-76 0x1.b6db6db6db6dbp-22 0x1.5555555555555p-76
38 0x1.555556p-78 0x1.b6db6db6db6dap-21 0x1.5555555555555p-78
39 0x1.555556p-80 0x1.b6db6db6db6dap-20 0x1.5555555555555p-80
40 0x1.555556p-82 0x1.b6db6db6db6dap-19 0x1.5555555555555p-82
41 0x1.555556p-84 0x1.b6db6db6db6dap-18 0x1.5555555555555p-84
42 0x1.555556p-86 0x1.b6db6db6db6dap-17 0x1.5555555555555p-86
43 0x1.555556p-88 0x1.b6db6db6db6dap-16 0x1.5555555555555p-88
44 0x1.555556p-90 0x1.b6db6db6db6dap-15 0x1.5555555555555p-90
45 0x1.555556p-92 0x1.b6db6db6db6dap-14 0x1.5555555555555p-92
46 0x1.555556p-94 0x1.b6db6db6db6dap-13 0x1.5555555555555p-94
47 0x1.555556p-96 0x1.b6db6db6db6dap-12 0x1.5555555555555p-96
48 0x1.555556p-98 0x1.b6db6db6db6dap-11 0x1.5555555555555p-98
49 0x1.555556p-100 0x1.b6db6db6db6dap-10 0x1.5555555555555p-100
50 0x1.555556p-102 0x1.b6db6db6db6dap-9 0x1.5555555555555p-102
51 0x1.555556p-104 0x1.b6db6db6db6dap-8 0x1.5555555555555p-104
52 0x1.555556p-106 0x1.b6db6db6db6dap-7 0x1.5555555555555p-106
53 0x1.555556p-108 0x1.b6db6db6db6dap-6 0x1.5555555555555p-108
54 0x1.555556p-110 0x1.b6db6db6db6dap-5 0x1.5555555555555p-110
55 0x1.555556p-112 0x1.b6db6db6db6dap-4 0x1.5555555555555p-112
56 0x1.555556p-114 0x1.b6db6db6db6dap-3 0x1.5555555555555p-114
57 0x1.555556p-116 0x1.b6db6db6db6dap-2 0x1.5555555555555p-116
58 0x1.555556p-118 0x1.b6db6db6db6dap-1 0x1.5555555555555p-118
59 0x1.555556p-120 0x1.b6db6db6db6dap+0 0x1.5555555555555p-120
Here, we see first that the float
computation didn't produce the exact value the pow
formula gave (it doesn't have enough precision for that), but it was close enough that the difference was hidden by %g
's rounding. We also see that the float
values are decreasing by exactly a factor of 4 each time, as are the values from the altered double
computation. The double
values from the original double
version start out almost doing that and then diverge once the amplified error overwhelms the computation. The values eventually start increasing by a factor of 2 instead of decreasing by a factor of 4.
Upvotes: 6
Reputation: 3029
This is a calculation, that cannot be done using floating points. You are adding large and small numbers and the rounding error is too large for this kind of calculation.
And the 1/3 and 1/12 are just lucky initial start for float calculation. For other initial values, the both calculations are giving almost the same results and both are usually way wrong.
Upvotes: 0