LolloFake
LolloFake

Reputation: 65

Double and float may have the same precision?

I have to implements a program that calculate the machine epsilon for float and double.
I wrote these functions:

int feps(){
    //machine epsilon for float
    float tmp=1;
    int d=0;
    while(1+(tmp=tmp/2)>1.0f)d++;
    return d;
}

int deps(){
    //machine epsilon for double
    double tmp=1;
    int d=0;
    while(1+(tmp=tmp/2)>1.0)d++;
    return d;
}


Note:
64 bit machine compiler gcc 4.9.1 target: x86_64-linux-gnu
32 bit machine compiler gcc 4.8.2 target: i686-linux-gnu

I tried it in a 64 bit machine and the results are:
Float 23
Double 52
As I expected, then I tried it in a 32 bit virtual machine and the results were very strange:
Float 63
Double 63
I also tried to compile my program with -mpc32, -mpc64 and -mpc80 and these are the result:
-mpc32 Float 23, Double 23
-mpc64 Float 52, Double 52
-mpc80 Float 63, Double 63
I also tried these compilation option in the 64 bit machine but the results were always 23 and 52.
I know that float are single precision and double are double precision but it's possible that the compiler of my 32 bit virtual machine use a binary80 format for both float and double?

I'm quite sure my code is correct so I think that the problem is something related with compiler or something more subtle.
I've spend the entire day searching information about floating point and I've read something about MMX/SSE instruction but I didn't understand a lot, and something about x87 FPU that may create some problem.


Update:
I want to thank everyone who helped me, I managed to get the real epsilon value for float and double in the 32 bit virtual machine, that's the code:

int feps(){
    float tmp=1;
    int d=0;
    float tmp2=1;
    do{
        tmp2=1+(tmp=tmp/2);
        d++;
    }while(tmp2>1.0f);
    return d-1;
}

int deps(){
    double tmp=1;
    int d=0;
    double tmp2=1;
    do{
        tmp2=1+(tmp=tmp/2);
        d++;
    }while(tmp2>1.0);
    return d-1;
}

as you can see we need to put the intermediate result into a variable, in that way we can prevent that 1+(tmp=tmp/2) is evaluated as a long double in the cycle test.

Upvotes: 3

Views: 356

Answers (1)

Pascal Cuoq
Pascal Cuoq

Reputation: 80255

On the 32-bit platform, ABI constraints make it simpler to use historical floating-point registers; as a consequence, the compiler defines FLT_EVAL_METHOD as 2. This is how you get:

Float 63
Double 63

In short, when FLT_EVAL_METHOD is defined to 2 by the compiler, as is the case on your 32-bit virtual machine, floating-point expressions and constants are evaluated to the precision of long double, regardless of their types, and only assignments to lvalues and explicit casts round the computed values from long double to the actual floating-point type. There are no such constructs at toplevel of the expression 1+(tmp=tmp/2), so the addition is evaluated to the precision of long double.

This two-post series shows some examples on which FLT_EVAL_METHOD makes a difference in addition to yours. GCC's behavior is deterministic, and according to the explanation laid out by J.S.Myers. Clang's behavior is nondeterministic (then and now) and developers have little interest in improving this mode of their compiler.

Upvotes: 3

Related Questions