Reputation: 60381
Consider two way of computing something:
Can the second solution give a less accurate result compared to the first one and if yes in what case?
Upvotes: 4
Views: 555
Reputation: 222724
Yes. Proof: Let c = 0x1p-53 + 0x1p-64. Evaluate 1+c-c-1 in double and in long double (of the common Intel format, with a 64-bit significand). In double, the result is 0, which is the mathematically exact answer. In long double, the result is -0x1p-64, which is wrong (and remains wrong when cast to double).
In double, 1+c adds slightly more than half the ULP (unit of least precision) of 1 to 1, so it produces 1 plus an ULP. Subtracting c subtracts slightly more than half an ULP, so the closest representable number (in double) to the result is 1, so 1 is produced. Then subtracting 1 yields 0.
In long double, 1+c adds 0x1p-53 plus half an ULP of 1. (In long double, the ULP of 1 is 0x1p-63.) Since the result is exactly the same distance from the two nearest representable numbers (in long double), the one with the low bit zero is returned, 1+0x1p-53. Then the exact result of subtracting c is 1 - 0x1p-64. This is exactly representable, so it is returned. Finally, subtracting 1 yields -0x1p-64.
Upvotes: 6
Reputation: 111130
About long double
the draft says:
3.9.1 Fundamental Types
8 There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum and minimum values of each arithmetic type for an implementation.
As for promotions which is the next most interesting bit:
4.6 Floating point promotion
1 A prvalue of type float can be converted to a prvalue of type double. The value is unchanged.
2 This conversion is called floating point promotion.
Note there is nothing being said about double
to long double
. I'd hazard this as a slip though.
Next about conversions which is what we are interested when you go from long double
to double
:
4.8 Floating point conversions
1 A prvalue of floating point type can be converted to a prvalue of another floating point type. If the source value can be exactly represented in the destination type, the result of the conversion is that exact representation. If the source value is between two adjacent destination values, the result of the conversion is an implementation-defined choice of either of those values. Otherwise, the behavior is undefined.
2 The conversions allowed as floating point promotions are excluded from the set of floating point conversions.
Now, let's see the effects of narrowing:
6. A narrowing conversion is an implicit conversion
[...]
- from long double to double or float, or from double to float, except where the source is a constant expression and the actual value after conversion is within the range of values that can be represented (even if it cannot be represented exactly)
There are two takeaways from all this standardese:
long double
are in a range that cannot be represented accurately by a double
(high or low), these can accumulate to return a different final result which you will want to return back as a double
.As for which is more accurate, I think that depends entirely on your application.
Upvotes: 0