Floating Point operators correctly rounded implementation

Question

I'm Implementing a floating point library in C, basically it's based on the IEEE standard. I've started with the addition.

I've some problem to understand the rounding.

Maybe an example could help

x = -3.652248e-11 
y = 1.263346e-10 
Cz = 8.981213e-11
Mz = 8.981214e-11
Cz = 8.98121e-11
Mz = 8.98121e-11
x = ae20a0a7
y = 2f0ae808
Cf = 2ec57fbc
Mf = 2ec57fbd
Cz = 457fbc
Mz = 457fbd
Ce = 5d
Me = 5d

In the above, x and y are random input value, Cz is the result of the C operator + and finally Mz is the result of my implementation.

Cf and Mf are bit representation of the final result. As you can see the final bit is different and i don't understand why. I've took inspiration for my implementation from an handbook of floating point arithmetic.

What I do not understand, I guess, is how actually the rounding is performed. My addition algorithm is based on the identity

x + y = (-1)^{sx}2^Ex(|x| + (-1)^(sx xor sy) |y| 2^{Ey-Ex})

where if I name the quantity

|z| = (|x| + (-1)^(sx xor sy) |y| 2^{Ey-Ex})

Basically the problem arises when I need to post-normalize the result using left shifting, be careful in my case |z| is always positive. Which rounding technique should be applied here in this case?

Stephen Canon · Accepted Answer

My copy of Muller et al. is on loan to a friend, so I can't double-check the algorithm that you're using specifically, but walking through addition of the values you list:

x = 0xae20a0a7 = -b1.01000001010000010100111 * 2^-35
y = 0x2f0ae808 = +b1.00010101110100000001000 * 2^-33

If we normalize x and y to a common exponent and add, we get the un-normalized infinitely precise result:

  b100.01010111010000000100000 * 2^-35
-   b1.01000001010000010100111 * 2^-35
-------------------------------------
   b11.00010101111111101111001 * 2^-35

Now normalize without rounding yet:

b1.10001010111111110111100 1 * 2^-34
                          ^
                          rounding point

The infinitely-precise result is exactly halfway between the two nearest floating-point numbers, so we choose the even one, and round down to

b1.10001010111111110111100 * 2^-34 = 0x2ec57fbc

Given that this is an exact halfway case, the most likely explanation for why you're not getting the correct answer is that you're not handling this ties to even part of the rounding rule correctly. If you try to round by simply adding half an ulp and truncating, you would get exactly the result you are observing.

Floating Point operators correctly rounded implementation

Answers (1)

Related Questions