How to add two float numbers resulting in overflow

Question

I have two binary fraction numbers that I want to add:

1.100110011001100110011001100110011001100110011001101 x 2^-4

and

11.001100110011001100110011001100110011001100110011010 x 2^-4

If I simply add them, it seems to result into overflow (54 bits):

      1.100110011001100110011001100110011001100110011001101
   + 11.001100110011001100110011001100110011001100110011010
      -----------------------------------------------------
    100.110011001100110011001100110011001100110011001100111

How do I handle that if I still need to store it as double precision 52 bit mantissa?

Patricia Shanahan · Accepted Answer

The next step after the addition is to adjust the exponent so that the leading significant bit is immediately before the binary point. In this case, you will need to add two to the exponent.

The new significand is 1.00110011001100110011001100110011001100110011001100111

Now round to 53 bits, dropping the final 11 and adjusting according to the rounding mode. If round-to-nearest, you will need to round up.

How to add two float numbers resulting in overflow

Answers (1)

Related Questions