Reputation: 203

floating point addition and subtraction

When performing addition of floating point binary numbers, typically you would change the smaller exponent to match the larger exponent, then adjust the mantissa accordingly. Once the mantissas are aligned they can be added together. The result is then normalised if necessary.

Why do we typically adjust the smaller exponent to match the larger? What not the other way around? When performing these calculations by hand the result is the same whatever the approach.

Upvotes: 3

Answers (2)

chux

Reputation: 154169

Why do we typically adjust the smaller exponent to match the larger?

It may be less work than adjusting the larger value as simplifications can be made adjusting the smaller value.

"then adjust the mantissa accordingly" has more to it than only a shift.

Consider the addition/subtraction of normalized a,b with n bit significand and expo(a) >= expo(b).

All n bits of the significand of a are used.

The exponent of the b is made the same as the larger a and the lesser b significand is shifted, but maybe not all of it is explicitly remembered. Besides the b bits that remain aligned with a, 2 shifted out bits are remembered and the “or” of all the other bits shifted out.

Example, b shifted (right) n-6 places.

1.23456789….......n 
a.aaaaaaaa…aaaaaaaa 000
0.00000000…00bbbbbb bbz (z is the “or” of all the less significant bits)

Now the addition/subtraction can be carried out using n+3+1¹ bit math. The 2 shifted out bits and the z are sufficient under all rounding modes to form the expected sum/difference.

¹ +1 for overflow.

Without this simplification, a much wider than n+3 bit integer math is needed. Perhaps even 100s of bits.

Example, a shifted (left) n-6 places.

aa aaaaaaaa a.aaaaaa00…00000000
            b.bbbbbbbb…bbbbbbbb

Upvotes: 2

Eric Postpischil

Reputation: 223804

When adding numbers with the same sign (or subtracting numbers with opposite signs), the result has the same exponent as the greater operand or one more (according to whether carry occured or not). So there is less shifting to do if the smaller number is adjusted to match the larger.

With subtraction of numbers with the same sign (or addition of numbers of opposite signs), cancellation can leave the leading digit in a variety of positions, so there may be less difference between the choices. However, if the smaller number is adjusted to match the larger, only shifting in one direction is needed. If the larger is adjusted, there is an additional decision to make about in which direction a shift is required.

Upvotes: 2

floating point addition and subtraction

Answers (2)

Related Questions