Reputation: 494
I'm new to programming and have recently come up with this simple question . float type has 32 bits in which 8 bits are for the whole number part (the mantissa). so my question is can float type hold numbers bigger than 255.9999 ?
and I would also appreciate if someone told me why this code is behaving unexpectedly. Is it a related issue?
int main(){
float a=123456789.1;
printf("%lf",a);
return 0;
}
for which the output is :
123456792.000000
Upvotes: 2
Views: 31345
Reputation: 11
I noticed nobody gave an answer for the second part of your question, about why the code does not work as expected for posterity's sake.
The answer to that is hidden in Eric's answer: Because arithmetic operations must be performed to determine the output of a number (e.g. 9.2), floating point numbers are essentially an estimation of a number, not a precisely defined representation of the number.
For integers, we are simply counting up. A 3 follows a 2, which follows a 1, and that took 2 bits to display. To go to 4, however, we need a third bit. Floating point, however, uses scientific notation to store the number, which is how it's able to represent decimal numbers.
To borrow from this answer for a similar question, in floating-point notation the number "9.2" is actually this fraction:
5179139571476070 * 2⁻⁴⁹
That's not a perfect representation of 9.2, so depending on the size of the float it will be off by some small fraction.
There are boundaries within the floating point "number line" that are more or less accurate. Depending on where things fall, the floating-point representation may be exact.
If you need the exact number, you should use an integer. If you need an exact number that has decimals, your only real option is to split out the decimal portion and use another integer for that portion of the number.
Upvotes: 1
Reputation: 222039
The most common 32-bit floating-point format, IEEE-754 binary32, does not have eight bits for the whole number part. It has one bit for a sign, eight bits for an exponent field, and 23 bits for a significand field (a fraction part).
The sign bit determines whether the number is positive (0) or negative (1).
The exponent field, e, has several uses. If it is 11111111 (in binary), and the significand field, f, is zero, the floating-point value represents infinity. If e is 11111111, and the significand field is not zero, it represents a special Not-a-Number “value”.
If the exponent is not 11111111 and is not zero, floating-point value represents 2e−127•(1+f/223), with the sign added. Note that the fraction portion is formed by adding 1 to the contents of the significand field. That is often called an implicit 1, so the mathematical significand is 24 bits—1 bit from the leading 1, 23 bits from the significand field.
If the exponent is zero, floating-point value represents 21−127•(0+f/223) or the negative of that if the sign bit is 1. Note that the leading bit is 0. These are called subnormal numbers. They are included in the format to make some mathematical properties work in floating-point arithmetic.
The largest finite value represented is when the exponent is 11111110 (254) and the significand field is all ones (f is 223−1), so the number represented is 2254−127•(1+ (223−1)/223) = 2127•(2−2−23) = 2128−2104 = 340282346638528859811704183484516925440.
In float a=123456789.1;
, the float
type does not have enough precision to represent 123456789.1. (In fact, a decimal fraction .1 can never be represented with a binary floating-point format.) When we have only 24 bits for the significand, the closest numbers to 123456789.1 that we can represent are 123456792 and 123456800.
Upvotes: 7
Reputation: 70213
<float.h>
-- Numeric limits of floating point types has your answers, specifically...
- FLT_MAX
- DBL_MAX
- LDBL_MAX
maximum finite value of
float
,double
andlong double
respectively
...and...
- FLT_DIG
- DBL_DIG
- LDBL_DIG
number of decimal digits that are guaranteed to be preserved in text ->
float
/double
/long double
-> text roundtrip without change due to rounding or overflow
That last part is meant to say that a float
value longer (i.e. more significant digits) than FLT_DIG
is no longer guaranteed to be precisely representable.
Upvotes: 5