Vincent
Vincent

Reputation: 60381

C++ cast to a more precise type and lose accuracy?

Consider two way of computing something:

  1. data in double precision
    -> apply a function with double precision temporaries
    -> return result
  2. data in double precision
    -> cast to long double
    -> apply a function with long double precision temporaries
    -> cast to double
    -> return result

Can the second solution give a less accurate result compared to the first one and if yes in what case?

Upvotes: 4

Views: 555

Answers (2)

Eric Postpischil
Eric Postpischil

Reputation: 222724

Yes. Proof: Let c = 0x1p-53 + 0x1p-64. Evaluate 1+c-c-1 in double and in long double (of the common Intel format, with a 64-bit significand). In double, the result is 0, which is the mathematically exact answer. In long double, the result is -0x1p-64, which is wrong (and remains wrong when cast to double).

In double, 1+c adds slightly more than half the ULP (unit of least precision) of 1 to 1, so it produces 1 plus an ULP. Subtracting c subtracts slightly more than half an ULP, so the closest representable number (in double) to the result is 1, so 1 is produced. Then subtracting 1 yields 0.

In long double, 1+c adds 0x1p-53 plus half an ULP of 1. (In long double, the ULP of 1 is 0x1p-63.) Since the result is exactly the same distance from the two nearest representable numbers (in long double), the one with the low bit zero is returned, 1+0x1p-53. Then the exact result of subtracting c is 1 - 0x1p-64. This is exactly representable, so it is returned. Finally, subtracting 1 yields -0x1p-64.

Upvotes: 6

dirkgently
dirkgently

Reputation: 111130

About long double the draft says:

3.9.1 Fundamental Types

8 There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum and minimum values of each arithmetic type for an implementation.

As for promotions which is the next most interesting bit:

4.6 Floating point promotion

1 A prvalue of type float can be converted to a prvalue of type double. The value is unchanged.

2 This conversion is called floating point promotion.

Note there is nothing being said about double to long double. I'd hazard this as a slip though.

Next about conversions which is what we are interested when you go from long double to double:

4.8 Floating point conversions

1 A prvalue of floating point type can be converted to a prvalue of another floating point type. If the source value can be exactly represented in the destination type, the result of the conversion is that exact representation. If the source value is between two adjacent destination values, the result of the conversion is an implementation-defined choice of either of those values. Otherwise, the behavior is undefined.

2 The conversions allowed as floating point promotions are excluded from the set of floating point conversions.

Now, let's see the effects of narrowing:

6. A narrowing conversion is an implicit conversion

[...]

  • from long double to double or float, or from double to float, except where the source is a constant expression and the actual value after conversion is within the range of values that can be represented (even if it cannot be represented exactly)

There are two takeaways from all this standardese:

  • Combining the bit about narrowing with the bit about implementation defined conversions there may be changes in your results across platforms.
  • If your intermediate results (considering multiple such results) in long double are in a range that cannot be represented accurately by a double (high or low), these can accumulate to return a different final result which you will want to return back as a double.

As for which is more accurate, I think that depends entirely on your application.

Upvotes: 0

Related Questions