Patrick
Patrick

Reputation: 33

On the round half to even of the denormalized number to the smallest normal number in IEEE754

This is a very simple question but needs an expert to answer.

As we know, with subnormal in floating-point, we get rid off the gap between 2^emin and 0.

By the round half to even (RTE) mode, we should round the infinite precision result as (0.1111...1|r)*2^emin or (1.1111...1|r')*2^(emin-1)? The number on the left of the point is the implicit bit.

For the first case: In the paper What every computer scientist should know about floating-point arithmetic numerical computing, in Figure2, I see the space on the left of 2^emin is the same as the space on its right. So, straightforwardly, the numerical value of the left number is 2^emin - 2^(1-p) (p=24 in flp32). If we do RTE rounding, seems we should use the bit after 24bit significant as the rounding bit (i.e. (0.111_1111_1111_1111_1111_1111|r)*2^emin, see

--|---+---+....+-?-|---+---+....+---|-------+........
--0---------------2^emin---------2^emin+1

I use a question mark (?) on the axis to represent the half point

For the second case: In the IEEE standard, for subnormal detection, before rounding it says "unbounded" exponent blabla. So if we can have unbounded exponent, we can shift the exact result to (1.1111...1|r')*2^(emin-1). Under this case, we have the half-sized left space on 2^emin. This is similar to all other adjacent spaces on 2^e, but once closing to 0 on the axis, the number of spaces is infinite. See

--|...++++|-+-+-+...|---+---+....+-?-|-------+........
--0-....----------2^emin-1----------2^emin
------| here is keep shrinking

In this case, seems we should round the exact result as

(1.111_1111_1111_1111_1111_1111|r')*2^(emin-1)

By 1-bit-left shifting the case 1 result, means guard bit is useful in this case.

In these two cases, we have different rounding bit, so may get different results. Which case we should follow? I couldn't see any document/paper talks about this topic clearly.

Upvotes: 1

Views: 655

Answers (3)

Patrick
Patrick

Reputation: 33

I found this (since I'm not python expert, so I'm not sure if this is golden enough):

import numpy as np
import struct
def float_to_hex(f):
    return hex(struct.unpack('<I', struct.pack('<f', f))[0])
if __name__=='__main__':
    min_normal = np.float64(1*(2**-126))
    max_denorm = np.float64(1*(2**-126)-1*(2**-126)*(2**-23))
    # Emin(-126) * (0.111_1111_1111_1111_1111_1111)_11
    n1 = np.float32(max_denorm+min_normal*np.float64(2**-24)+min_normal*np.float64(2**-25))
    # Emin(-126) * (0.111_1111_1111_1111_1111_1111)_10
    n2 = np.float32(max_denorm+min_normal*np.float64(2**-24))
    # Emin(-126) * (0.111_1111_1111_1111_1111_1111)_01
    n3 = np.float32(max_denorm+min_normal*np.float64(2**-25))

    print(float_to_hex(n1))
    print(float_to_hex(n2))
    print(float_to_hex(n3))

python2 on x86_64 linux output is: 0x800000 0x800000 0x7fffff

looks python2 follows round_half_to_even by default, and pick up the bit after 24 bits significand as the rounding bit.

This is what I found so far, maybe Python or arithmetic experts can give some feedbacks.

Ps. This feedback is too long for a comment, so I put it here@PatriciaShanahan .

Thanks Patricia

Upvotes: 0

Patricia Shanahan
Patricia Shanahan

Reputation: 26185

The OP needs practical ways to test results. Java has several useful properties. Its rounding, in strictfp mode, matches the IEEE754 standard, so it can serve as a reference implementation. This program illustrates rounding of the number exactly half way between the 32 bit minimum positive normal and largest subnormal numbers.

public strictfp class Test {
  public static void main(String[] args) {
    printIt(Float.MIN_NORMAL);
    printIt(Math.nextDown(Float.MIN_NORMAL));
    double d = ((double)Float.MIN_NORMAL + (double)Math.nextDown(Float.MIN_NORMAL))/2;
    printIt((float)d);
  }
  static void printIt(float f){
    int bits = Float.floatToIntBits(f);
    String s = Integer.toBinaryString(bits);
    while(s.length() < 32){
      s = "0" + s;
    }
    System.out.println(s);
  }
}

Output:

00000000100000000000000000000000
00000000011111111111111111111111
00000000100000000000000000000000

In a comment, the OP asked for analysis of the following cases:

n1=Math.nextDown(Float.MIN_NORMAL) + Float.MIN_NORMAL*(2^-24) + Float.MIN_NORMAL*(2^-25) 

n2=Math.nextDown(Float.MIN_NORMAL) + Float.MIN_NORMAL*(2^-24)

I am assuming the intermediate calculations should be done in real number, not float, arithmetic.

According to my calculations, the real number values are:

n1=0.111_1111_1111_1111_1111_1111_1011_1111_1111_1111_1111_1110_1 * MIN_NORMAL
n2=0.111_1111_1111_1111_1111_1111_0111_1111_1111_1111_1111_1111 * MIN_NORMAL

The mid-point between MIN_NORMAL and the maximum subnormal is:

0.111_1111_1111_1111_1111_1111_1 * MIN_NORMAL

n1 is greater than the mid-point, and therefore must round up to MIN_NORMAL. n2 is less than the mid-point, and therefore must round down to the largest subnormal. Those are the results I get from Java.

Upvotes: 0

user2357112
user2357112

Reputation: 282104

IEEE 754 rounding is not specified in terms of rounding bits or guard digits. It is specified in terms of the real number the computation would have produced if we could do our math in exact, real-number arithmetic instead of limited-precision floating point.

When the exact value of a computation is exactly halfway between two representable numbers, round-half-to-even rounds to the option whose representation in the current floating-point format would have an even least-significant mantissa digit. This can also be thought of as rounding to the option that would have a higher power of two if both options were written as an odd integer times a power of two (and treating 0 as having a higher power of two than other numbers).

Guard digits may be involved in an implementation, as long as the implementation produces the specified rounding behavior. IEEE 754 does not mandate any particular rounding implementation, only the behavior of the various rounding modes.

Upvotes: 6

Related Questions