pmor
pmor

Reputation: 6276

IEEE 754: rationale for format: subnormal and normal numbers

Can someone please clarify:

  1. Why exactly the format of subnormal numbers is ±(0.F) × 2^-126 and not ±(1.F) × 2^-127?
  2. Why exactly the format of normal numbers is: ±(1.F) × 2^exp and not, say, ±(11.F) × 2^exp, or, say, ±(10.F) × 2^exp?

Upvotes: 0

Views: 351

Answers (2)

pmor
pmor

Reputation: 6276

I checked the properties of both format using simplified example. For the sake of simplicity I use formats 0.F × 10^-2 and 1.F × 10^-3, where F has 2 decimal digits and there is no ±.

Min (non-zero) / max values:

Format          Min value (non-zero)           Max value
0.F × 10^-2     0.01 × 10^-2 = 0.0001          0.99 × 10^-2 = 0.0099
1.F × 10^-3     1.00 × 10^-3 = 0.001           9.99 × 10^-3 = 0.00999

Here is the graphical representation:

enter image description here

Here we see that starting from value 0.001 format 1.F × 10^-3 does not allow anymore to represent smaller values. However, format 0.F × 10^-2 allows to represent smaller values. Here is the zoomed-in version:

enter image description here

Conclusion: from the graphical representation we see that the properties of format 0.F × 10^-2 over format 1.F × 10^-3 are:

  1. gives more dynamic range: log10(max_real / min_real): 1.99 vs 0.99
  2. gives less precision: less values can be represented: 100 vs 900

It seems that for subnormals IEEE 754 preferred more dynamic range despite of less precision. Hence, that is why the format of subnormal numbers is ±(0.F) × 2^-126 and not ±(1.F) × 2^-127.

Upvotes: 0

Eric Postpischil
Eric Postpischil

Reputation: 222901

A floating-point format represents numbers using a sign (− or +), an exponent (an integer in some range emin to emax, inclusive), and a significand that is a numeral of p digits in base b, where b is the fixed base for the format and p is called the precision. We will consider a binary format, in which b is two.

Let the digits of the significand be f0, f−1, f−2,… f1−p, so the significand is sum−p<i≤0 fibi, and the value represented is (−1)sbe•sum−p<i≤0 fibi, where s is a bit for the sign and e is the exponent.

If f0 is zero, we can omit it from the sum, and the value represented equals (−1)sbe•sum−p<i≤−1 fibi = (−1)sbe−1•sum−p<i≤−1 fibi+1 = (−1)sbe−1•sum1−p<i≤0 fi−1bi. Therefore, when f0 is zero, and e is not emin, there are two representations of the number. Encoding both of them would be wasteful, so we desire an encoding scheme that does not encode both representations.

We accomplish this:

  • Some value E encodes the exponent e. The values of s and f−1 to f1−p are stored directly as bits.
  • If E is zero, e is emin and f0 is zero.
  • If E is not zero, e is Ebias and f0 is one, where bias is 1−emin.
  • (A special value of E may be reserved to represent infinities and NaNs, not discussed further here.)

This representation and this encoding scheme answer the questions:

Why exactly the format of subnormal numbers is ±(0.F) × 2− 126 and not ±(1.F) × 2−127?

Subnormal numbers of the form ±(1.F) × 2−127 would fail to include zero and would include numbers not in the represented numbers of the format, as they would have numbers with non-zero digits below that of the lowest non-zero digit in the chosen set. (The lowest digit of the form described in the first paragraph corresponds to bemin+(1−p), whereas numbers in the form ±(1.F) × 2−127 would have a lowest digit corresponding to bemin−1+(1−p).)

Why exactly the format of normal numbers is: ±(1.F) × 2exp and not, say, ±(11.F) × 2exp, or, say, ±(10.F) × 2exp?

Where the decimal point (or “radix point”) lies in the significand is irrelevant, as long as it is fixed. A representation described using the decimal point just after the first digit, as used herein, is equivalent to a representation using decimal point after the last digit or at any other position, with a suitable adjustment to the exponent bounds: The same set of numbers is represented and the arithmetic properties are identical. So, in considering the difference between 1.F and 11.F, we do not care where the decimal point lies. However, we do care about how many digits are represented. A floating-point format uses a representation with a fixed number of digits. 11.F has one more digit than 1.F, and we have no reason to encode that.

As for the difference between 11.F and 10.F, the reason the normal/subnormal distinction exists is because arithmetically there are two representations of the same number if the first digit is zero and the exponent is not at the minimum. Specifying one form as normal form allows us to eliminate these duplicate representations. However 11.F and 10.F represent different numbers, so there is no duplicate to eliminate and no reason to say one of these is normal and the other is not.

Upvotes: 0

Related Questions