Reputation: 155
I am trying to figure out exactly how big number I can use as floating point number and double
. But it does not store the way I expected except integer value. double
should hold 8 bytes of information which is enough to hold variable a, but it does not hold it right. It shows 1234567890123456768
in which last 2 digits are different. And when I stored 214783648
or any digit in the last digit in float variable b
, it shows the same value 214783648
. which is supposed to be the limit. So what's going on?
double a;
float b;
int c;
a = 1234567890123456789;
b = 2147483648;
c = 2147483647;
printf("Bytes of double: %d\n", sizeof(double));
printf("Bytes of integer: %d\n", sizeof(int));
printf("Bytes of float: %d\n", sizeof(float));
printf("\n");
printf("You can count up to %.0f in 4 bytes\n", pow(2,32));
printf("You can count up to %.0f with + or - sign in 4 bytes\n", pow(2,31));
printf("You can count up to %.0f in 4 bytes\n", pow(2,64));
printf("You can count up to %.0f with + or - sign in in 8 bytes\n", pow(2,63));
printf("\n");
printf("double number: %.0f\n", a);
printf("floating point: %.0f\n", b);
printf("integer: %d\n", c);
return 0;
Upvotes: 6
Views: 22585
Reputation: 42149
The answer to the question of what is the largest (finite) number that can be stored in a floating point type would be FLT_MAX
or DBL_MAX
for float
and double
, respectively.
However, that doesn't mean that the type can precisely represent every smaller number or integer (in fact, not even close).
First you need to understand that not all bits of a floating point number are “equal”. A floating point number has an exponent (8 bits in IEEE-754 standard float
, 11 bits in double
), and a mantissa (23 and 52 bits in float
, and double
respectively). The number is obtained by multiplying the mantissa (which has an implied leading 1-bit and binary point) by 2exponent (after normalizing the exponent; its binary value is not used directly). There is also a separate sign bit, so the following applies to negative numbers as well.
As the exponent changes, the distance between consecutive values of the mantissa changes as well, i.e., the greater the exponent, the further apart consecutive representable values of the floating point number are. Thus you may be able to store one number of a given magnitude precisely, but not the “next” number. One should also remember that some seemingly simple fractions can not be represented precisely with any number of binary digits (e.g., 1/10
, one tenth, is an infinitely repeating sequence in binary, like 1/3
, one third, is in decimal).
When it comes to integers, you can precisely represent every integer up to 2mantissa_bits + 1 magnitude. Thus an IEEE-754 float
can represent all integers up to 224 and a double
up to 253 (in the last half of these ranges the consecutive floating point values are exactly one integer apart, since the entire mantissa is used for the integer part only). There are individual larger integers that can be represented, but they are spaced more than one integer apart, i.e., you can represent some integers greater than 2mantissa_bits + 1 but every integer only up to that magnitude.
For example:
float f = powf(2.0f, 24.0f);
float f1 = f + 1.0f, f2 = f1 + 2.0f;
double d = pow(2.0, 53.0);
double d1 = d + 1.0, d2 = d + 2.0;
(void) printf("2**24 float = %.0f, +1 = %.0f, +2 = %.0f\n", f, f1, f2);
(void) printf("2**53 double = %.0f, +1 = %.0f, +2 = %.0f\n", d, d1, d2);
Outputs:
2**24 float = 16777216, +1 = 16777216, +2 = 16777218
2**53 double = 9007199254740992, +1 = 9007199254740992, +2 = 9007199254740994
As you can see, adding 1
to 2mantissa_bits + 1 makes no difference since the result is not representable, but adding 2
does produce the correct answer (as it happens, at this magnitude the representable numbers are two integers apart since the multiplier has doubled).
TL;DR An IEE-754 float
can precisely represent all integers up to 224 and double
up to 253, but only some integers of greater magnitude (the spacing of representable values depends on the magnitude).
Upvotes: 13
Reputation: 12263
The sizeof
an object only reports the memory space it occupies. It does not show the valid range. It would be well possible to have an unsigned int
with e.g. 2**16 (65536) possible value occupy 32 bits im memory.
For floating point objects, it is more difficult. They consist of (simplified) two fields: an integer mantissa and an exponent (see details in the linked article). Both with a fixed width.
As the mantissa only has a limited range, trailing bits are truncated or rounded and the exponent is corrected, if required. This is one reason one should never use floating point types to store precise values like currency.
In decimal (note: computers use binary representation) with 4 digit mantissa:
1000 --> 1.000e3
12345678 --> 1.234e7
The paramters for your implementation are defined in float.h
similar to limits.h
which provides parameters for integers.
Upvotes: 0
Reputation: 1468
You can print the actual limits of the standard POD-types by printing the limits stored in the 'limits.h' header file (for C++ the equivalent is 'std::numeric_limits' identifier as shown here: enter link description here)
Due to the fact that the hardware doesn't work with floating types respectively cannot represent floating types by hardware in reality, the hardware uses the bit-length of your hardware to represent a floating type. Since you don't have an infinit length for floating types, you can only show/present a double variable for a specific precision. Most of the hardware uses for the floating type presentation the IEEE-754 standard.
To get more precision you could try 'long double' (dependend on the hardware this could be of quadruple-precision than double), AVX,SSE registers, big-num libraries or you coudl do it yourself.
Upvotes: 0
Reputation: 122493
sizeof(double)
is 8
, true, but double
needs some bits to store the exponent part as well.
Assuming IEEE-754 is used, double
can represent integers at most 253 precisely, which is less than 1234567890123456789
.
See also Double-precision floating-point format.
Upvotes: 7
Reputation: 8279
On Linux, #include <values.h>
On Windows,#include <float.h>
There is a fairly comprehensive list of defines
Upvotes: -2
Reputation: 56552
You can use constants to know what are the limits :
FLT_MAX
DBL_MAX
LDBL_MAX
Upvotes: 3