Bobo Feugo
Bobo Feugo

Reputation: 162

Precision of floating-point data types in C++

Why don't the precision of floating-point data types grow proportional to its size? E.g.:

std::cout << sizeof(float) << "\n";  // this gives 4 on my machine "debian 64 bit" with "gcc 6.3.0"  
std::cout << std::numeric_limits<float>::digits10  << "\n"; // gives 6

std::cout << sizeof(double) << "\n";  // gives 8
std::cout << std::numeric_limits<double>::digits10 <<  "\n"; // gives 15

std::cout << sizeof(long double) <<  "\n";  // gives 16
std::cout << std::numeric_limits<long double>::digits10  << "\n"; // gives 18

As you can see the precision of double is about double as precision of float, and that makes sense as the size of double is double as size of float.

But this is not the same case between double and long double, the size of long double is 128-bit which is twice as that of 64-bit double, but its precision is only three digits more!!

I have no idea how floating-point numbers are implemented, but from a rational standpoint does it even make sense to use 64 bits more of memory for only three digits of precision?!

I searched around but was not able to find a simple, straightforward answer. If someone could explain why the precision of long double only three digits more than double, can you also explain why this is not the same case as between double and float?

And I also want to know how can I get better precision, without defining my own data type which obviously going to be at expense of performance?

Upvotes: 2

Views: 3910

Answers (3)

phuclv
phuclv

Reputation: 41962

There many incorrect assumptions in your question

First, there's no requirement regarding types' sizes in C++. The standard only mandates a minimum precision of each type and that...

... The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf

Most modern implementations map float and double to IEEE-754 single and double-precision formats though, as hardware support for them is mainstream. However long double don't have such wide support, because few people need higher precision than double, and hardware for those cost a lot more. Therefore some platforms map it to IEEE-754 double-precision, i.e. the same as double. Some others map it to the 80-bit IEEE 754 extended-precision format if the underlying hardware supports it. Otherwise long double will be represented by double-double arithmetic or IEEE-754 quadruple-precision

Moreover precision also doesn't scale linearly to the number of bits in the type. It's easy to see that double is more than twice as precise as float and 8 times wider range than float despite only twice the storage, because it has 53 bits of significand compared to 24 in float and 3 more exponent bits. Types can also have trap representations or padding bits so different types may have different ranges even though they have the same size and belong to the same category (integral or floating-point)

So the important thing here is std::numeric_limits<long double>::digits. If you print that you'll see that long double has 64 bits of significand which is just 11 bits more than double. See it live. That means your compiler uses the 80-bit extended-precision for long double, the rest is just padding bytes to keep the alignment. In fact gcc has various options that will change your output:

  • -malign-double and -mno-align-double for controlling the alignment of long double
  • -m96bit-long-double and -m128bit-long-double for changing the padding size
  • -mlong-double-64, -mlong-double-80 and -mlong-double-128 for controlling the underlying long double implementation

By changing the options you'll get the below results for long double

You'll get size = 10 if you disable padding, but that'll come at a performance expense due to misalignment. See more demo on compiler explorer

In PowerPC you can also see the same phenomena when changing the floating-point format. With -mabi=ibmlongdouble (double-double arithmetic, which is the default) you'll get (size, digits10, digits2) = (16, 31, 106) but with -mabi=ieeelongdouble the tuple will become (16, 33, 113)

For more information you should read https://en.wikipedia.org/wiki/Long_double

And I also want to know how can I get better precision, without defining my own data type

The keyword to search is arbitrary-precision arithmetic. There are various libraries for that which you can find in the List of arbitrary-precision arithmetic software. You can find more information in the tags , or

Upvotes: 2

Eric Postpischil
Eric Postpischil

Reputation: 224052

The C++ standard does not set fixed requirements for floating-point types, aside from some minimum levels they must meet.

Likely the C++ implementation you are using targets an Intel processor. Aside from the common IEEE-754 basic 32-bit and 64-bit binary floating-point formats, Intel has an 80-bit format. Your C++ implementation is probably using that for long double.

Intel’s 80-bit format has 11 more bits for the significand than the 64-bit double format does. (It actually uses 64 where the double format uses 52, but one of them is reserved for an explicit leading 1.) 11 more bits means 211=2048 times as many significand values, which is about three more decimal digits.

The 80-bit format (which is ten bytes) is preferentially aligned to multiples of 16 bytes, so six bytes of padding are included to make the long double size a multiple of 16 bytes.

Upvotes: 2

Daniel Jour
Daniel Jour

Reputation: 16156

"Precision" is not all that is to a floating point value. It's also about "magnitude" (not sure if that term is correct though!): How big (or small) can the represented values become?

For that, try printing also the max_exponent of each type:

std::cout << "float: " << sizeof(float) << "\n";
std::cout << std::numeric_limits<float>::digits << "\n";
std::cout << std::numeric_limits<float>::max_exponent << "\n";

std::cout << "double: " << sizeof(double) << "\n";
std::cout << std::numeric_limits<double>::digits << "\n";
std::cout << std::numeric_limits<double>::max_exponent << "\n";

std::cout << "long double: " <<  sizeof(long double) << "\n";
std::cout << std::numeric_limits<long double>::digits << "\n";
std::cout << std::numeric_limits<long double>::max_exponent << "\n";

Output on ideone:

float: 4
24
128
double: 8
53
1024
long double: 16
64
16384

So the extra bits are not all used to represent more digits (precision) but allow the exponent to be larger. Using the wording from IEE 754 long double mostly increases the exponent range rather than the precision.

The format which is shown by my ideone sample above shows (probably) the "x86 extended precision format" which assigns 1 bit for the integer part, 63 bits for the fraction part (together 64 digits) and 15 bits (2^(15-1) = 16384, 1 bit used for the sign of the exponent) for the exponent.

Note that the C++ standard only requires long double to be at least as precise as double, so long double could be either a synonym to double, the shown x86 extended precision format (most likely) or better (AFAIK only GCC on PowerPC).

And I also want to know how can I get better precision, without defining my own data type which obviously going to be at expense of performance?

You need to either write it on your own (surely a learning experience, best not to do for production code) or use a library, like GNU MPFR or Boost.Multiprecision.

Upvotes: 2

Related Questions