Floating Point addition using integer operations

Question

I am writing code for enumerating floating point addition in C++ using integer addition and shifts for some homework. I have googled the topic and I am able to add floating point numbers by adjusting exponents and then adding. The problem is I could not find the appropriate algorithm for rounding off result. Right now I am using truncation. It shows errors of something like 0.000x magnitude. But when I try to use this adder for complex calculations like fft's, it shows enormous errors. So what I am looking for now is the exact algorithm that is used by my machine for rounding off floating point results. It would be great if someone can post some link for the purpose.

Thanks in advance.

Floating Point addition using integer operations

Answers (1)

Related Questions