Reputation: 1

C - Summing double variables gives different outputs depending on how it's written

I have these variables:

double a = 5;
double b = 1E20;
double c = -b;

and I have summed them like this:

double temp = a+b;
double result = temp+c;

The result equals 0, as expected because 'b' is a big number compared to 'a' and as such, the result doesn't differ from b's number at all, and subtracting it with 'c', which is the same as 'b' but negative, gives us 0. However, if I try it this way:

double result = (a+b)+c;

The result is actually 8. Why is that?

Upvotes: 0

Answers (1)

Eric Postpischil

Reputation: 222753

Presumably you are executing this program on an Intel processor. (When asking questions like this, you should always state which compiler you are using, including the version and the command-line switches, and which system you are running the program on.) Intel processors have a 80-bit floating-point format which has 64-bit significands. (A significand is the fraction portion of a floating-point number.)

It appears your compiler is using the processor’s 80-bit floating-point format for intermediate calculations, and it is probably using the IEEE-754 basic 64-bit binary format for double. The C standard allows C implementations to evaluate floating-point expressions with more range and precision than the nominal type. That means, when the compiler is evaluating (or generating code to evaluate) a double expression, it is allowed to use the 80-bit type.

When a floating-point expression is assigned to an object or there is an explicit cast to a floating-point type, the C standard requires the C implementation to “discard” the excess precision.

The above allows us to see what happened. 1e20 represents 10²⁰, which is a number between 2⁶⁶ and 2⁶⁷. Written in binary, its leading bit is in the position for value 2⁶⁶. Since the 80-bit format has 64-bit significands, the least significant bit that can be represented in the format is at position 2³ (having bits from 3 to 66 is 64 bits). After b = 1e20, when you add 5 to b, the result has to be rounded to fit in bits from 2⁶⁶ to 2³ (which is 8). This results in rounding the number up to the next multiple of 8. Thus, due to rounding, b+5 has the same result as b+8. Then, when you add c, which equals -b, you get 8.

In double temp = a+b;, the assignment forces the C implementation to “discard” the excess precision. Thus, it must convert the result to the double format, which has 53-bit significands. With a leading bit of 2⁶⁶, the least significant bit is 2¹⁴. The bits for 2¹³ to 2³ are discarded, and the remaining bits are rounded (which does not cause any change in this case, as the discarded bits happen to be less than the midpoint). Thus, although a+b equals b+8, as we saw above, the result of converting b+8 to double is just b. Then adding c to this produces 0.

Upvotes: 2

C - Summing double variables gives different outputs depending on how it&#39;s written

Answers (1)

Related Questions

C - Summing double variables gives different outputs depending on how it's written