mathguy
mathguy

Reputation: 1518

How many digits can float8, float16, float32, float64, and float128 contain?

Numpy's dtype documentation only shows "x bits exponent, y bits mantissa" for each float type, but I couldn't translate that to exactly how many digits before/after the decimal point. Is there any simple formula/table to look this up in?

Upvotes: 23

Views: 42578

Answers (2)

Netch
Netch

Reputation: 4562

This is not as simple as usually expected. For accuracy of mantissa, there generally are two values:

  1. Given a value in decimal representation, how many decimal digits can be guaranteed to be preserved if converted from a decimal to a selected binary format and back (with default rounding).

  2. Given a value in binary format, how many decimal digits are needed if the value is converted to decimal format and back to the original binary format (again, with default rounding) to get the original value unchanged.

In both cases, the decimal representation is treated as independent of the exponent, without leading and trailing zeros (for example, all of 0.0123e4, 1.23e2, 1.2300e2, 123, 123.0, 123000.000e-3 are 3 digits).

For 32-bit binary floats, these two sizes are 6 and 9 decimal digits, respectively. In C <float.h>;, these are FLT_DIG and FLT_DECIMAL_DIG. (This is weird that 32-bit float keeps 7 decimal digits for total most of all numbers, but there are exceptions.) In C++, look at std::numeric_limits<float>::digits10 and std::numeric_limits<float>::max_digits10, respectively.

For 64-bit binary floats, these are 15 and 17 (DBL_DIG and DBL_DECIMAL_DIG, respectively; and std::numeric_limits<double>::{digits10, max_digits10}).

General formulas for them (thx2 @MarkDickinson)

  • ${format}_DIG (digits10): floor((p-1)*log10(2))
  • ${format}_DECIMAL_DIG (max_digits10): ceil(1+p*log10(2))

where p is number of digits in mantissa (including hidden one for normalized IEEE754 case).

Also, comments with some mathematical explanation at C++ numeric limits page:

The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.

Look for values for 16- and 128-bit floats in comments (but see below for what is 128-bit float in real).

For the exponent, this is simpler because each of the border values (minimum normalized, minimum denormalized, maximum represented) are exact and can be easily obtained and printed.

@PaulPanzer suggested numpy.finfo. It gives first of these values ({format}_DIG); maybe it is the thing you search:

>>> numpy.finfo(numpy.float16).precision
3
>>> numpy.finfo(numpy.float32).precision
6
>>> numpy.finfo(numpy.float64).precision
15
>>> numpy.finfo(numpy.float128).precision
18

but, on most systems (my one was Ubuntu 18.04 on x86-84) the value is confusing for float128; it is really for 80-bit x86 "extended" float with a 64 bit significand; real IEEE754 float128 has 112 significand bits and so the real value will be around 33, but numpy presents another type under this name. See here for details: in general, float128 is a delusion in numpy.

UPD3: you mentioned float8 - there is no such type in IEEE754 set. One could imagine such type for some utterly specific purposes, but its range will bee too narrow for any universal usage.

Upvotes: 22

SREERAG R NANDAN
SREERAG R NANDAN

Reputation: 703

To keep it simple.

Normally as the magnitude of the value increases or decreases, the number of decimal digits of precision increases or decreases respectively

Generally,

Data-Type | Precision
----------------------
float16   | 3 to 4
float32   | 6 to 9
float64   | 15 to 17
float128  | 18 to 34

if you understood don't forget to upvote the answer

Bitwise properties:

float16 : 1 sign bit, 5 exponent bit, 10-bit significand (fractional part).

float32 : 1 sign bit, 8 exponent bit, and 23-bit significand (fractional part).

float64 : 1 sign bit, 11 exponent bits, and 52 fraction bits.

float128 : 1 sign bit, 15 exponent bits, and 112 fraction bits.

Upvotes: 21

Related Questions