zdfzr
zdfzr

Reputation: 61

Float is more precise than double?

There is a sequence:

x(n+2)=9/4*x(n+1)-1/2*x(n)

x(1)=1/3,x(2)=1/12

The exact result is x(n)=4^(1-n)/3

I wish to show the rounding error of x(60) in calculation.

My code is

#include <stdio.h>
#include <math.h>
int main(void)
{
    float x[60];
    x[0] = 1./3;
    x[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        x[i] = 9./4*x[i-1]-1./2*x[i-2];
    }
    double y[60];
    y[0] = 1./3;
    y[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        y[i] = 9./4*y[i-1]-1./2*y[i-2];
    }
    printf("single:%g, double:%g, exact:%g\n", x[59], y[59], pow(4,-59)/3);
    return 0;
}

I compile it with gcc:

gcc seq.c

The output is:

single:1.00309e-36, double:1.71429, exact:1.00309e-36

If I change the code above like this:

#include <stdio.h>
#include <math.h>
int main(void)
{
    float x[60];
    x[0] = 1.f/3;
    x[1] = 1.f/12;
    for (int i = 2; i < 60; i++) {
        x[i] = 9.f/4*x[i-1]-1.f/2*x[i-2];
    }
    double y[60];
    y[0] = 1./3;
    y[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        y[i] = 9./4*y[i-1]-1./2*y[i-2];
    }
    printf("single:%g, double:%g, exact:%g\n", x[59], y[59], pow(4,-59)/3);
    return 0;
}

where 'f' is added after constant floating-point number for calculation of x-array.

The output seems normal:

single:-9.2035e+08, double:1.71429, exact:1.00309e-36

My question is:

Why does the result of float data type equal to the exact result in the first situation?

What does the compiler do?

Upvotes: 5

Views: 318

Answers (3)

4386427
4386427

Reputation: 44340

It seems to me that you are well aware that floating point rounding errors can lead to completely wrong results. Actually, it seems that you are more surprised by getting the "correct" result in example 1 than you are surprised by the wrong result in example 2.

Well, rounding errors may lead to extremely wrong results but you can't assume that roundings errors will always lead to extremely wrong results. Sometimes the rounding errors will only cause a minor error and at other times the rounding errors will cause the whole calculation to be unstable and lead to extreme errors.

The answer from @user2357112 https://stackoverflow.com/a/55194247/4386427 gives a good description of your specific case.

But one part of your question is still unanswered:

What does the compiler do?

I assume that you are asking why this code

a) x[i] = 9./4*x[i-1]-1./2*x[i-2];

gives different results than this code

b) x[i] = 9.f/4*x[i-1]-1.f/2*x[i-2];
            ^            ^

The answer is that case a) requires that the calculations is carried out with double precision as 9. has the type double while case b) allows that the calculation is carried out with single precision as all types are floats.

If the compiler decides to use single precision operations instead of double the rounding errors will be different in case a) and case b). Different rounding errors (may) lead to different results as discussed above.

There is no single answer explaining what the compiler will do as different compilers may give different result. Below is one example generated using https://godbolt.org/ and x86-64 gcc 8.3 and -O0.

For simplicity it only covers 9./4*x[i-1] versus 9.f/4*x[i-1] and I have only copied the lines that differs.

   Case 9./4*x[i-1]:

    movss   xmm0, DWORD PTR [rax]
    cvtss2sd        xmm1, xmm0
    movsd   xmm0, QWORD PTR .LC0[rip]
    mulsd   xmm0, xmm1

and

   Case 9.f/4*x[i-1]:

    movss   xmm1, DWORD PTR [rax]
    movss   xmm0, DWORD PTR .LC0[rip]
    mulss   xmm0, xmm1
    cvtss2sd        xmm0, xmm0

As it can be seen the difference is use of single precision (mulss) and double precision (mulsd).

So to conclude:

What does the compiler do?

It uses different floating point precision for the calculation leading to different rounding errors which again leads to different results.

Upvotes: 0

user2357112
user2357112

Reputation: 281842

float is not more precise than double, and your float computation has not given you the exact result of pow(4,-59)/3.

What's going on is that your recurrence is designed to take a tiny rounding error and amplify it every iteration. In exact math, each value should be exactly one quarter of the previous value, but if it's not exactly a quarter due to rounding error, the difference gets magnified on every step.

Since a quarter of a representable value is always representable (until you hit subnormal numbers and underflow issues), the recurrence has an additional property: if the computation is performed in a precision sufficiently in excess of the precision with which the results are stored, then rounding the results to lower precision for storage will round to exactly a quarter of the previous value. (The choice of 9./4 and 1./2 factors give the recurrence an even stronger version of this property, where the result is exactly a quarter of the old value even before rounding for storage.)


With doubles, with the compiler and compiler settings you're using, the rounding error occurs and gets amplified. With floats, the computations are performed in double precision, eliminating rounding error in the recurrence steps due to the properties described above, so there is nothing to amplify. If the computation for doubles had been performed at long double precision, the same thing would have happened.


Let's take a closer look at the exact values produced, by using the %a format specifier to print floating-point numbers in hexadecimal notation. That looks like 0x1.5555555555558p-6, where the part between 0x and p is a hexadecimal number, and the part after the p is a decimal number representing a power of two to multiply the hexadecimal number by. Here, 0x1.5555555555558p-6 represents 0x1.5555555555558 times 2^-6. %a format always prints the exact value of a float or double, unlike %g, which rounds.

We'll also show a third computation, storing results as doubles, but doing the math in long double precision.

Our altered program looks like this:

#include <stdio.h>
#include <math.h>
int main(void)
{
    float x[60];
    x[0] = 1./3;
    x[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        x[i] = 9./4*x[i-1]-1./2*x[i-2];
    }
    double y[60];
    y[0] = 1./3;
    y[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        y[i] = 9./4*y[i-1]-1./2*y[i-2];
    }
    double z[60];
    z[0] = 1./3;
    z[1] = 1./12;
    for (int i = 2; i < 60; i++) {
        z[i] = (long double) 9./4*z[i-1] - (long double) 1./2*z[i-2];
    }
    printf("float:%a, double:%a, double2:%a, formula:%a\n", x[59], y[59], z[59], pow(4,-59)/3);
    for (int i = 0; i < 60; i++) {
        printf("%d %a %a %a\n", i, x[i], y[i], z[i]);
    }
    return 0;
}

And here's the output. I was going to abridge this, but it turns out it's hard to do that without obscuring interesting parts of the pattern:

float:0x1.555556p-120, double:0x1.b6db6db6db6dap+0, double2:0x1.5555555555555p-120, formula:0x1.5555555555555p-120
0 0x1.555556p-2 0x1.5555555555555p-2 0x1.5555555555555p-2
1 0x1.555556p-4 0x1.5555555555555p-4 0x1.5555555555555p-4
2 0x1.555556p-6 0x1.5555555555558p-6 0x1.5555555555555p-6
3 0x1.555556p-8 0x1.555555555557p-8 0x1.5555555555555p-8
4 0x1.555556p-10 0x1.555555555563p-10 0x1.5555555555555p-10
5 0x1.555556p-12 0x1.5555555555c3p-12 0x1.5555555555555p-12
6 0x1.555556p-14 0x1.5555555558c3p-14 0x1.5555555555555p-14
7 0x1.555556p-16 0x1.5555555570c3p-16 0x1.5555555555555p-16
8 0x1.555556p-18 0x1.5555555630c3p-18 0x1.5555555555555p-18
9 0x1.555556p-20 0x1.5555555c30c3p-20 0x1.5555555555555p-20
10 0x1.555556p-22 0x1.5555558c30c3p-22 0x1.5555555555555p-22
11 0x1.555556p-24 0x1.5555570c30c3p-24 0x1.5555555555555p-24
12 0x1.555556p-26 0x1.5555630c30c3p-26 0x1.5555555555555p-26
13 0x1.555556p-28 0x1.5555c30c30c3p-28 0x1.5555555555555p-28
14 0x1.555556p-30 0x1.5558c30c30c3p-30 0x1.5555555555555p-30
15 0x1.555556p-32 0x1.5570c30c30c3p-32 0x1.5555555555555p-32
16 0x1.555556p-34 0x1.5630c30c30c3p-34 0x1.5555555555555p-34
17 0x1.555556p-36 0x1.5c30c30c30c3p-36 0x1.5555555555555p-36
18 0x1.555556p-38 0x1.8c30c30c30c3p-38 0x1.5555555555555p-38
19 0x1.555556p-40 0x1.8618618618618p-39 0x1.5555555555555p-40
20 0x1.555556p-42 0x1.e186186186186p-39 0x1.5555555555555p-42
21 0x1.555556p-44 0x1.bc30c30c30c3p-38 0x1.5555555555555p-44
22 0x1.555556p-46 0x1.b786186186185p-37 0x1.5555555555555p-46
23 0x1.555556p-48 0x1.b6f0c30c30c3p-36 0x1.5555555555555p-48
24 0x1.555556p-50 0x1.b6de186186185p-35 0x1.5555555555555p-50
25 0x1.555556p-52 0x1.b6dbc30c30c3p-34 0x1.5555555555555p-52
26 0x1.555556p-54 0x1.b6db786186185p-33 0x1.5555555555555p-54
27 0x1.555556p-56 0x1.b6db6f0c30c3p-32 0x1.5555555555555p-56
28 0x1.555556p-58 0x1.b6db6de186185p-31 0x1.5555555555555p-58
29 0x1.555556p-60 0x1.b6db6dbc30c3p-30 0x1.5555555555555p-60
30 0x1.555556p-62 0x1.b6db6db786185p-29 0x1.5555555555555p-62
31 0x1.555556p-64 0x1.b6db6db6f0c3p-28 0x1.5555555555555p-64
32 0x1.555556p-66 0x1.b6db6db6de185p-27 0x1.5555555555555p-66
33 0x1.555556p-68 0x1.b6db6db6dbc3p-26 0x1.5555555555555p-68
34 0x1.555556p-70 0x1.b6db6db6db785p-25 0x1.5555555555555p-70
35 0x1.555556p-72 0x1.b6db6db6db6fp-24 0x1.5555555555555p-72
36 0x1.555556p-74 0x1.b6db6db6db6ddp-23 0x1.5555555555555p-74
37 0x1.555556p-76 0x1.b6db6db6db6dbp-22 0x1.5555555555555p-76
38 0x1.555556p-78 0x1.b6db6db6db6dap-21 0x1.5555555555555p-78
39 0x1.555556p-80 0x1.b6db6db6db6dap-20 0x1.5555555555555p-80
40 0x1.555556p-82 0x1.b6db6db6db6dap-19 0x1.5555555555555p-82
41 0x1.555556p-84 0x1.b6db6db6db6dap-18 0x1.5555555555555p-84
42 0x1.555556p-86 0x1.b6db6db6db6dap-17 0x1.5555555555555p-86
43 0x1.555556p-88 0x1.b6db6db6db6dap-16 0x1.5555555555555p-88
44 0x1.555556p-90 0x1.b6db6db6db6dap-15 0x1.5555555555555p-90
45 0x1.555556p-92 0x1.b6db6db6db6dap-14 0x1.5555555555555p-92
46 0x1.555556p-94 0x1.b6db6db6db6dap-13 0x1.5555555555555p-94
47 0x1.555556p-96 0x1.b6db6db6db6dap-12 0x1.5555555555555p-96
48 0x1.555556p-98 0x1.b6db6db6db6dap-11 0x1.5555555555555p-98
49 0x1.555556p-100 0x1.b6db6db6db6dap-10 0x1.5555555555555p-100
50 0x1.555556p-102 0x1.b6db6db6db6dap-9 0x1.5555555555555p-102
51 0x1.555556p-104 0x1.b6db6db6db6dap-8 0x1.5555555555555p-104
52 0x1.555556p-106 0x1.b6db6db6db6dap-7 0x1.5555555555555p-106
53 0x1.555556p-108 0x1.b6db6db6db6dap-6 0x1.5555555555555p-108
54 0x1.555556p-110 0x1.b6db6db6db6dap-5 0x1.5555555555555p-110
55 0x1.555556p-112 0x1.b6db6db6db6dap-4 0x1.5555555555555p-112
56 0x1.555556p-114 0x1.b6db6db6db6dap-3 0x1.5555555555555p-114
57 0x1.555556p-116 0x1.b6db6db6db6dap-2 0x1.5555555555555p-116
58 0x1.555556p-118 0x1.b6db6db6db6dap-1 0x1.5555555555555p-118
59 0x1.555556p-120 0x1.b6db6db6db6dap+0 0x1.5555555555555p-120

Here, we see first that the float computation didn't produce the exact value the pow formula gave (it doesn't have enough precision for that), but it was close enough that the difference was hidden by %g's rounding. We also see that the float values are decreasing by exactly a factor of 4 each time, as are the values from the altered double computation. The double values from the original double version start out almost doing that and then diverge once the amplified error overwhelms the computation. The values eventually start increasing by a factor of 2 instead of decreasing by a factor of 4.

Upvotes: 6

V-X
V-X

Reputation: 3029

This is a calculation, that cannot be done using floating points. You are adding large and small numbers and the rounding error is too large for this kind of calculation.

And the 1/3 and 1/12 are just lucky initial start for float calculation. For other initial values, the both calculations are giving almost the same results and both are usually way wrong.

Upvotes: 0

Related Questions