Reputation: 497
Consider a 32 bit floating point number (IEEE 754) having 0-22 for mantissa(23 bits) , 23-30 for exponent(8 bits) , 31 for sign(1bit)
I want to find out the smallest positive number that can be stored.
I have been told answer is 1.18*10-38 which is approx 2-126
My analysis is as follows
if we put all zeroes in mantissa and put all ones in exponent then the decimal equivalent would be
1.0 x 2-128 = 2.93 x 10-39
Where am I going wrong ?
Thanks
Upvotes: 4
Views: 18585
Reputation: 47952
I think of IEEE-754 numbers as being divided into three main categories: specials, normals, and subnormals. These categories are based on the value of the exponent, and there's also some substructure within each category.
Specials are those with the maximum exponent value, subnormals have an exponent that's the minimum, and normals are everything in between. We can summarize things in a table (with the specific values here being those for single-precision float
, as you asked about):
exponent | significand | category | adjusted significand | adj. exp. |
---|---|---|---|---|
FF |
nonzero | NaN | * | n/a |
FF |
0 | infinity | n/a | n/a |
01 – FE |
anything | normals | (1)000000 – (1)7fffff |
-126 – +127 |
00 |
nonzero | subnormals | 000000 – 7fffff |
-126 |
00 |
0 | zero | 0 | n/a |
The key is that:
0x01
to 0xfe
or 1 to 254, minus the bias of 127).Now, you might think that for the subnormals, since the raw exponent is 0 and the exponent bias is 127, the actual exponent should be -127. (That's what I thought for a long time, too.) But that would leave a gap in the subnormals. So the exponent for the subnormals is -126, and is one higher than you might have expected, and ends up matching the exponent for the smallest of the normals.
So what do these ranges work out to?
For normals, the maximum raw significand is 0x7fffff
, or 0xffffff
with the implicit 1 bit added, which as a fraction is 0x1.fffffe
, or 1.99999988079071044921875. The minimum raw significand is 0x000000
, or 0x800000
with the implicit 1 bit added, which is 0x1.000000
, or 1.0.
For subnormals, the maximum raw significand is 0x7fffff
, which as a fraction is 0x0.fffffe
, or 0.99999988079071044921875. The minimum raw significand is 0x000001
, which is 0x0.000002
, or 0.00000011920928955078125.
Putting this all together with the maximum and minimum exponent values, we have:
threshold | derivation | decimal | hex |
---|---|---|---|
max normal | 1.99999988 × 2127 | 3.4028234663852885981e+38 | 0xf.fffff0E+31 |
min normal | 1.0 × 2-126 | 1.175494350822287508e-38 | 0x4.000000E-32 |
max subnormal | 0.99999988 × 2-126 | 1.175494210692441075e-38 | 0x3.fffff8E-32 |
min subnormal | 0.000000119 × 2-126 | 1.401298464324817071e-45 | 0x8.000000E-38 |
So when you heard that the smallest float
was 1.18 × 10-38, obviously someone was talking about the smallest normal number, and ignoring the existence of the subnormals. As you can see, the smallest of the subnormals is quite a bit smaller.
In this table we can also see why the exponent for the subnormals has to be -126, not -127. The subnormals are supposed to cover the range between the smallest normal and zero. With an exponent of -126, they do that uniformly and well. If the exponent for the subnormals were -127, on the other hand, the largest subnormal would be 0.9999998 × 2-127 = 5.877471053462205377e-39 or 0x1.fffffcE-32
, which is already halfway down the slope to zero (so to speak), with the rest of the subnormals jammed in below that, leaving a "big" gap between 1.175e-38 and 5.877e-39. Wikipedia has a nice picture from the "subnormal number" page illustrating the way the subnormal numbers fill the gap near 0.
See also this question for more on how IEEE-754 floating-point values are constructed.
Footnote: Where I've used a notation like 0x1.fffffe
in this answer, that's a base-16 fraction, which of course is not something your C compiler would accept. And then 0xf.fffffE+31
is hexadecimal scientific notation, where the exponent is a power of 16, and the E
is not a hexadecimal digit that's part of the significand. This is sort of like the printf
/scanf
format %a
, although %a
uses p
to mark its exponent, which is a power of 2.
Upvotes: 6
Reputation: 2547
Although 8 bits exponent means -127 to +128 but two case is reserved for special values (See here), so the most negative exponent is -126.
BTW, it's impossible to store -128 in 8 bits in Two's Complement
system which is the base system used in exponent of IEEE 754
.
Upvotes: -2
Reputation: 4189
If you put all ones in exponent you will get NaN
if mantissa is non-zero or infinite if mantissa is 0. Wikipedia IEEE 754. Also your minimal value is inside Denormal numbers space when exponent is binary equal to 0.
Upvotes: 2