Reputation: 6276
Can someone please clarify:
±(0.F) × 2^-126
and not ±(1.F) × 2^-127
?±(1.F) × 2^exp
and not, say, ±(11.F) × 2^exp
, or, say, ±(10.F) × 2^exp
?Upvotes: 0
Views: 351
Reputation: 6276
I checked the properties of both format using simplified example. For the sake of simplicity I use formats 0.F × 10^-2
and 1.F × 10^-3
, where F
has 2 decimal digits and there is no ±
.
Min (non-zero) / max values:
Format Min value (non-zero) Max value
0.F × 10^-2 0.01 × 10^-2 = 0.0001 0.99 × 10^-2 = 0.0099
1.F × 10^-3 1.00 × 10^-3 = 0.001 9.99 × 10^-3 = 0.00999
Here is the graphical representation:
Here we see that starting from value 0.001
format 1.F × 10^-3
does not allow anymore to represent smaller values. However, format 0.F × 10^-2
allows to represent smaller values. Here is the zoomed-in version:
Conclusion: from the graphical representation we see that the properties of format 0.F × 10^-2
over format 1.F × 10^-3
are:
log10(max_real / min_real)
: 1.99 vs 0.99
100 vs 900
It seems that for subnormals IEEE 754 preferred more dynamic range
despite of less precision
. Hence, that is why the format of subnormal numbers is ±(0.F) × 2^-126
and not ±(1.F) × 2^-127
.
Upvotes: 0
Reputation: 222901
A floating-point format represents numbers using a sign (− or +), an exponent (an integer in some range emin to emax, inclusive), and a significand that is a numeral of p digits in base b, where b is the fixed base for the format and p is called the precision. We will consider a binary format, in which b is two.
Let the digits of the significand be f0, f−1, f−2,… f1−p, so the significand is sum−p<i≤0 fi•bi, and the value represented is (−1)s•be•sum−p<i≤0 fi•bi, where s is a bit for the sign and e is the exponent.
If f0 is zero, we can omit it from the sum, and the value represented equals (−1)s•be•sum−p<i≤−1 fi•bi = (−1)s•be−1•sum−p<i≤−1 fi•bi+1 = (−1)s•be−1•sum1−p<i≤0 fi−1•bi. Therefore, when f0 is zero, and e is not emin, there are two representations of the number. Encoding both of them would be wasteful, so we desire an encoding scheme that does not encode both representations.
We accomplish this:
This representation and this encoding scheme answer the questions:
Why exactly the format of subnormal numbers is ±(0.F) × 2− 126 and not ±(1.F) × 2−127?
Subnormal numbers of the form ±(1.F) × 2−127 would fail to include zero and would include numbers not in the represented numbers of the format, as they would have numbers with non-zero digits below that of the lowest non-zero digit in the chosen set. (The lowest digit of the form described in the first paragraph corresponds to bemin+(1−p), whereas numbers in the form ±(1.F) × 2−127 would have a lowest digit corresponding to bemin−1+(1−p).)
Why exactly the format of normal numbers is: ±(1.F) × 2exp and not, say, ±(11.F) × 2exp, or, say, ±(10.F) × 2exp?
Where the decimal point (or “radix point”) lies in the significand is irrelevant, as long as it is fixed. A representation described using the decimal point just after the first digit, as used herein, is equivalent to a representation using decimal point after the last digit or at any other position, with a suitable adjustment to the exponent bounds: The same set of numbers is represented and the arithmetic properties are identical. So, in considering the difference between 1.F and 11.F, we do not care where the decimal point lies. However, we do care about how many digits are represented. A floating-point format uses a representation with a fixed number of digits. 11.F has one more digit than 1.F, and we have no reason to encode that.
As for the difference between 11.F and 10.F, the reason the normal/subnormal distinction exists is because arithmetically there are two representations of the same number if the first digit is zero and the exponent is not at the minimum. Specifying one form as normal form allows us to eliminate these duplicate representations. However 11.F and 10.F represent different numbers, so there is no duplicate to eliminate and no reason to say one of these is normal and the other is not.
Upvotes: 0