Max Koretskyi
Max Koretskyi

Reputation: 105497

How to add two float numbers resulting in overflow

I have two binary fraction numbers that I want to add:

1.100110011001100110011001100110011001100110011001101 x 2-4

and

11.001100110011001100110011001100110011001100110011010 x 2-4

If I simply add them, it seems to result into overflow (54 bits):

      1.100110011001100110011001100110011001100110011001101
   + 11.001100110011001100110011001100110011001100110011010
      -----------------------------------------------------
    100.110011001100110011001100110011001100110011001100111

How do I handle that if I still need to store it as double precision 52 bit mantissa?

Upvotes: 0

Views: 677

Answers (1)

Patricia Shanahan
Patricia Shanahan

Reputation: 26185

The next step after the addition is to adjust the exponent so that the leading significant bit is immediately before the binary point. In this case, you will need to add two to the exponent.

The new significand is 1.00110011001100110011001100110011001100110011001100111

Now round to 53 bits, dropping the final 11 and adjusting according to the rounding mode. If round-to-nearest, you will need to round up.

Upvotes: 2

Related Questions