Reputation: 162
Why don't the precision of floating-point data types grow proportional to its size? E.g.:
std::cout << sizeof(float) << "\n"; // this gives 4 on my machine "debian 64 bit" with "gcc 6.3.0"
std::cout << std::numeric_limits<float>::digits10 << "\n"; // gives 6
std::cout << sizeof(double) << "\n"; // gives 8
std::cout << std::numeric_limits<double>::digits10 << "\n"; // gives 15
std::cout << sizeof(long double) << "\n"; // gives 16
std::cout << std::numeric_limits<long double>::digits10 << "\n"; // gives 18
As you can see the precision of double
is about double as precision of float
, and that makes sense as the size of double
is double as size of float
.
But this is not the same case between double
and long double
, the size of long double
is 128-bit which is twice as that of 64-bit double
, but its precision is only three digits more!!
I have no idea how floating-point numbers are implemented, but from a rational standpoint does it even make sense to use 64 bits more of memory for only three digits of precision?!
I searched around but was not able to find a simple, straightforward answer.
If someone could explain why the precision of long double
only three digits more than double
, can you also explain why this is not the same case as between double
and float
?
And I also want to know how can I get better precision, without defining my own data type which obviously going to be at expense of performance?
Upvotes: 2
Views: 3910
Reputation: 41962
There many incorrect assumptions in your question
First, there's no requirement regarding types' sizes in C++. The standard only mandates a minimum precision of each type and that...
... The type
double
provides at least as much precision asfloat
, and the typelong double
provides at least as much precision asdouble
. The set of values of the typefloat
is a subset of the set of values of the typedouble
; the set of values of the typedouble
is a subset of the set of values of the typelong double
. The value representation of floating-point types is implementation-defined.http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf
Most modern implementations map float
and double
to IEEE-754 single and double-precision formats though, as hardware support for them is mainstream. However long double
don't have such wide support, because few people need higher precision than double, and hardware for those cost a lot more. Therefore some platforms map it to IEEE-754 double-precision, i.e. the same as double
. Some others map it to the 80-bit IEEE 754 extended-precision format if the underlying hardware supports it. Otherwise long double
will be represented by double-double
arithmetic or IEEE-754 quadruple-precision
Moreover precision also doesn't scale linearly to the number of bits in the type. It's easy to see that double
is more than twice as precise as float
and 8 times wider range than float
despite only twice the storage, because it has 53 bits of significand compared to 24 in float and 3 more exponent bits. Types can also have trap representations or padding bits so different types may have different ranges even though they have the same size and belong to the same category (integral or floating-point)
So the important thing here is std::numeric_limits<long double>::digits
. If you print that you'll see that long double
has 64 bits of significand which is just 11 bits more than double
. See it live. That means your compiler uses the 80-bit extended-precision for long double
, the rest is just padding bytes to keep the alignment. In fact gcc has various options that will change your output:
-malign-double
and -mno-align-double
for controlling the alignment of long double
-m96bit-long-double
and -m128bit-long-double
for changing the padding size-mlong-double-64
, -mlong-double-80
and -mlong-double-128
for controlling the underlying long double
implementationBy changing the options you'll get the below results for long double
-mlong-double-128
: size = 16, digits10 = 33, digits2 = 113-m96bit-long-double
: size = 12, digits10 = 18, digits2 = 64-mlong-double-64
: size = 8, digits10 = 15, digits2 = 53You'll get size = 10 if you disable padding, but that'll come at a performance expense due to misalignment. See more demo on compiler explorer
In PowerPC you can also see the same phenomena when changing the floating-point format. With -mabi=ibmlongdouble
(double-double arithmetic, which is the default) you'll get (size, digits10, digits2) = (16, 31, 106) but with -mabi=ieeelongdouble
the tuple will become (16, 33, 113)
For more information you should read https://en.wikipedia.org/wiki/Long_double
And I also want to know how can I get better precision, without defining my own data type
The keyword to search is arbitrary-precision arithmetic. There are various libraries for that which you can find in the List of arbitrary-precision arithmetic software. You can find more information in the tags bigint, biginteger or arbitrary-precision
Upvotes: 2
Reputation: 224052
The C++ standard does not set fixed requirements for floating-point types, aside from some minimum levels they must meet.
Likely the C++ implementation you are using targets an Intel processor. Aside from the common IEEE-754 basic 32-bit and 64-bit binary floating-point formats, Intel has an 80-bit format. Your C++ implementation is probably using that for long double
.
Intel’s 80-bit format has 11 more bits for the significand than the 64-bit double
format does. (It actually uses 64 where the double
format uses 52, but one of them is reserved for an explicit leading 1.) 11 more bits means 211=2048 times as many significand values, which is about three more decimal digits.
The 80-bit format (which is ten bytes) is preferentially aligned to multiples of 16 bytes, so six bytes of padding are included to make the long double
size a multiple of 16 bytes.
Upvotes: 2
Reputation: 16156
"Precision" is not all that is to a floating point value. It's also about "magnitude" (not sure if that term is correct though!): How big (or small) can the represented values become?
For that, try printing also the max_exponent
of each type:
std::cout << "float: " << sizeof(float) << "\n";
std::cout << std::numeric_limits<float>::digits << "\n";
std::cout << std::numeric_limits<float>::max_exponent << "\n";
std::cout << "double: " << sizeof(double) << "\n";
std::cout << std::numeric_limits<double>::digits << "\n";
std::cout << std::numeric_limits<double>::max_exponent << "\n";
std::cout << "long double: " << sizeof(long double) << "\n";
std::cout << std::numeric_limits<long double>::digits << "\n";
std::cout << std::numeric_limits<long double>::max_exponent << "\n";
Output on ideone:
float: 4
24
128
double: 8
53
1024
long double: 16
64
16384
So the extra bits are not all used to represent more digits (precision) but allow the exponent to be larger. Using the wording from IEE 754 long double
mostly increases the exponent range rather than the precision.
The format which is shown by my ideone sample above shows (probably) the "x86 extended precision format" which assigns 1 bit for the integer part, 63 bits for the fraction part (together 64 digits) and 15 bits (2^(15-1) = 16384, 1 bit used for the sign of the exponent) for the exponent.
Note that the C++ standard only requires long double
to be at least as precise as double
, so long double
could be either a synonym to double
, the shown x86 extended precision format (most likely) or better (AFAIK only GCC on PowerPC).
And I also want to know how can I get better precision, without defining my own data type which obviously going to be at expense of performance?
You need to either write it on your own (surely a learning experience, best not to do for production code) or use a library, like GNU MPFR or Boost.Multiprecision.
Upvotes: 2