C++ cast to a more precise type and lose accuracy?

Question

Consider two way of computing something:

data in double precision
-> apply a function with double precision temporaries
-> return result
data in double precision
-> cast to long double
-> apply a function with long double precision temporaries
-> cast to double
-> return result

Can the second solution give a less accurate result compared to the first one and if yes in what case?

Eric Postpischil · Accepted Answer

Yes. Proof: Let c = 0x1p-53 + 0x1p-64. Evaluate 1+c-c-1 in double and in long double (of the common Intel format, with a 64-bit significand). In double, the result is 0, which is the mathematically exact answer. In long double, the result is -0x1p-64, which is wrong (and remains wrong when cast to double).

In double, 1+c adds slightly more than half the ULP (unit of least precision) of 1 to 1, so it produces 1 plus an ULP. Subtracting c subtracts slightly more than half an ULP, so the closest representable number (in double) to the result is 1, so 1 is produced. Then subtracting 1 yields 0.

In long double, 1+c adds 0x1p-53 plus half an ULP of 1. (In long double, the ULP of 1 is 0x1p-63.) Since the result is exactly the same distance from the two nearest representable numbers (in long double), the one with the low bit zero is returned, 1+0x1p-53. Then the exact result of subtracting c is 1 - 0x1p-64. This is exactly representable, so it is returned. Finally, subtracting 1 yields -0x1p-64.

C++ cast to a more precise type and lose accuracy?

Answers (2)

Related Questions