Reputation: 1
I have these variables:
double a = 5;
double b = 1E20;
double c = -b;
and I have summed them like this:
double temp = a+b;
double result = temp+c;
The result equals 0, as expected because 'b' is a big number compared to 'a' and as such, the result doesn't differ from b's number at all, and subtracting it with 'c', which is the same as 'b' but negative, gives us 0. However, if I try it this way:
double result = (a+b)+c;
The result is actually 8. Why is that?
Upvotes: 0
Views: 75
Reputation: 222753
Presumably you are executing this program on an Intel processor. (When asking questions like this, you should always state which compiler you are using, including the version and the command-line switches, and which system you are running the program on.) Intel processors have a 80-bit floating-point format which has 64-bit significands. (A significand is the fraction portion of a floating-point number.)
It appears your compiler is using the processor’s 80-bit floating-point format for intermediate calculations, and it is probably using the IEEE-754 basic 64-bit binary format for double
. The C standard allows C implementations to evaluate floating-point expressions with more range and precision than the nominal type. That means, when the compiler is evaluating (or generating code to evaluate) a double
expression, it is allowed to use the 80-bit type.
When a floating-point expression is assigned to an object or there is an explicit cast to a floating-point type, the C standard requires the C implementation to “discard” the excess precision.
The above allows us to see what happened. 1e20
represents 1020, which is a number between 266 and 267. Written in binary, its leading bit is in the position for value 266. Since the 80-bit format has 64-bit significands, the least significant bit that can be represented in the format is at position 23 (having bits from 3 to 66 is 64 bits). After b = 1e20
, when you add 5 to b
, the result has to be rounded to fit in bits from 266 to 23 (which is 8). This results in rounding the number up to the next multiple of 8. Thus, due to rounding, b+5
has the same result as b+8
. Then, when you add c
, which equals -b
, you get 8.
In double temp = a+b;
, the assignment forces the C implementation to “discard” the excess precision. Thus, it must convert the result to the double
format, which has 53-bit significands. With a leading bit of 266, the least significant bit is 214. The bits for 213 to 23 are discarded, and the remaining bits are rounded (which does not cause any change in this case, as the discarded bits happen to be less than the midpoint). Thus, although a+b
equals b+8
, as we saw above, the result of converting b+8
to double
is just b
. Then adding c
to this produces 0.
Upvotes: 2