Reputation: 33
This is a very simple question but needs an expert to answer.
As we know, with subnormal in floating-point, we get rid off the gap between 2^emin
and 0
.
By the round half to even (RTE) mode, we should round the infinite precision result as (0.1111...1|r)*2^emin
or (1.1111...1|r')*2^(emin-1)
? The number on the left of the point is the implicit bit.
For the first case:
In the paper What every computer scientist should know about floating-point arithmetic numerical computing, in Figure2, I see the space on the left of 2^emin
is the same as the space on its right. So, straightforwardly, the numerical value of the left number is 2^emin - 2^(1-p)
(p=24
in flp32). If we do RTE rounding, seems we should use the bit after 24bit significant as the rounding bit (i.e. (0.111_1111_1111_1111_1111_1111|r)*2^emin
, see
--|---+---+....+-?-|---+---+....+---|-------+........
--0---------------2^emin---------2^emin+1
I use a question mark (?
) on the axis to represent the half point
For the second case: In the IEEE standard, for subnormal detection, before rounding it says "unbounded" exponent blabla. So if we can have unbounded exponent, we can shift the exact result to (1.1111...1|r')*2^(emin-1)
. Under this case, we have the half-sized left space on 2^emin
. This is similar to all other adjacent spaces on 2^e
, but once closing to 0 on the axis, the number of spaces is infinite. See
--|...++++|-+-+-+...|---+---+....+-?-|-------+........
--0-....----------2^emin-1----------2^emin
------| here is keep shrinking
In this case, seems we should round the exact result as
(1.111_1111_1111_1111_1111_1111|r')*2^(emin-1)
By 1-bit-left shifting the case 1 result, means guard bit is useful in this case.
In these two cases, we have different rounding bit, so may get different results. Which case we should follow? I couldn't see any document/paper talks about this topic clearly.
Upvotes: 1
Views: 655
Reputation: 33
I found this (since I'm not python expert, so I'm not sure if this is golden enough):
import numpy as np
import struct
def float_to_hex(f):
return hex(struct.unpack('<I', struct.pack('<f', f))[0])
if __name__=='__main__':
min_normal = np.float64(1*(2**-126))
max_denorm = np.float64(1*(2**-126)-1*(2**-126)*(2**-23))
# Emin(-126) * (0.111_1111_1111_1111_1111_1111)_11
n1 = np.float32(max_denorm+min_normal*np.float64(2**-24)+min_normal*np.float64(2**-25))
# Emin(-126) * (0.111_1111_1111_1111_1111_1111)_10
n2 = np.float32(max_denorm+min_normal*np.float64(2**-24))
# Emin(-126) * (0.111_1111_1111_1111_1111_1111)_01
n3 = np.float32(max_denorm+min_normal*np.float64(2**-25))
print(float_to_hex(n1))
print(float_to_hex(n2))
print(float_to_hex(n3))
python2 on x86_64 linux
output is:
0x800000
0x800000
0x7fffff
looks python2 follows round_half_to_even by default, and pick up the bit after 24 bits significand as the rounding bit.
This is what I found so far, maybe Python or arithmetic experts can give some feedbacks.
Ps. This feedback is too long for a comment, so I put it here@PatriciaShanahan .
Thanks Patricia
Upvotes: 0
Reputation: 26185
The OP needs practical ways to test results. Java has several useful properties. Its rounding, in strictfp
mode, matches the IEEE754 standard, so it can serve as a reference implementation. This program illustrates rounding of the number exactly half way between the 32 bit minimum positive normal and largest subnormal numbers.
public strictfp class Test {
public static void main(String[] args) {
printIt(Float.MIN_NORMAL);
printIt(Math.nextDown(Float.MIN_NORMAL));
double d = ((double)Float.MIN_NORMAL + (double)Math.nextDown(Float.MIN_NORMAL))/2;
printIt((float)d);
}
static void printIt(float f){
int bits = Float.floatToIntBits(f);
String s = Integer.toBinaryString(bits);
while(s.length() < 32){
s = "0" + s;
}
System.out.println(s);
}
}
Output:
00000000100000000000000000000000
00000000011111111111111111111111
00000000100000000000000000000000
In a comment, the OP asked for analysis of the following cases:
n1=Math.nextDown(Float.MIN_NORMAL) + Float.MIN_NORMAL*(2^-24) + Float.MIN_NORMAL*(2^-25)
n2=Math.nextDown(Float.MIN_NORMAL) + Float.MIN_NORMAL*(2^-24)
I am assuming the intermediate calculations should be done in real number, not float
, arithmetic.
According to my calculations, the real number values are:
n1=0.111_1111_1111_1111_1111_1111_1011_1111_1111_1111_1111_1110_1 * MIN_NORMAL
n2=0.111_1111_1111_1111_1111_1111_0111_1111_1111_1111_1111_1111 * MIN_NORMAL
The mid-point between MIN_NORMAL
and the maximum subnormal is:
0.111_1111_1111_1111_1111_1111_1 * MIN_NORMAL
n1
is greater than the mid-point, and therefore must round up to MIN_NORMAL
. n2
is less than the mid-point, and therefore must round down to the largest subnormal. Those are the results I get from Java.
Upvotes: 0
Reputation: 282104
IEEE 754 rounding is not specified in terms of rounding bits or guard digits. It is specified in terms of the real number the computation would have produced if we could do our math in exact, real-number arithmetic instead of limited-precision floating point.
When the exact value of a computation is exactly halfway between two representable numbers, round-half-to-even rounds to the option whose representation in the current floating-point format would have an even least-significant mantissa digit. This can also be thought of as rounding to the option that would have a higher power of two if both options were written as an odd integer times a power of two (and treating 0 as having a higher power of two than other numbers).
Guard digits may be involved in an implementation, as long as the implementation produces the specified rounding behavior. IEEE 754 does not mandate any particular rounding implementation, only the behavior of the various rounding modes.
Upvotes: 6