Reputation: 99
I am writing code for enumerating floating point addition in C++ using integer addition and shifts for some homework. I have googled the topic and I am able to add floating point numbers by adjusting exponents and then adding. The problem is I could not find the appropriate algorithm for rounding off result. Right now I am using truncation. It shows errors of something like 0.000x magnitude. But when I try to use this adder for complex calculations like fft's, it shows enormous errors. So what I am looking for now is the exact algorithm that is used by my machine for rounding off floating point results. It would be great if someone can post some link for the purpose.
Thanks in advance.
Upvotes: 0
Views: 218
Reputation: 223824
Most commonly, if the bits to be rounded away represent a value less than half that of the smallest bit to be retained, they are rounded downward, the same as truncation. If they represent more than half, they are rounded upward, thus adding one in the position of the smallest retained bit. If they are exactly half, they are rounded downward if the smallest retained bit is zero and upward if the bit is one. This is called “round-to-nearest, ties to even.”
This presumes you have all the bits you are rounding away, that none have been lost yet in the course of doing arithmetic. If you cannot keep all the bits, there are techniques for keeping track of enough information about them to do the correct rounding, such as maintaining three bits called guard, round, and sticky bits.
Upvotes: 2