Floating-point addition assembly algorithm

I'm trying to write a binary 8 bit floating point addition algorithm for a picoblaze microcontroller (1 sign bit, 4 exponent bits, and 3 mantissa bits)

I got it to work with positive numbers but I can't figure out how to do it when there are negative numbers too.

My main problem is setting the sign bit of the result, can someone explain how to set it correctly?

My idea was to check the sign of both numbers; then if they're both positive set the sign to 0, if they're both negative set the sign to 1 and use the same methods as before for the addition, and if one is negative and one is positive compare the numbers and use the sign bit of the larger one, but I'm not sure how to compare the two numbers and the code is getting a little cluttered, is there a better way to do it?

Upvotes: 2

Answers (3)

Alain Merigot

Reputation: 11557

You do not have to care about the sign of operands if you turn them to two's complement.

compare the exponents and align accordingly the mantissa of the number with the smallest exponent, while adding the hidden bit
turn the numbers to two's complement. This requires an extra bit at the left of the mantissa to take into account the sign bit and another bit to deal with addition overflows. As a result, negative numbers will be represented by a number $gt; 2, that is the complement of their absolute value to 2^3. Note that the two most significant bits are always equal.
Perform the addition.
detect overflows. If the two most significant bits of the result are not equal, there is an overflow. In that case, you must do an arithmetic right shift of the result and increment the exponent.
detect underflows. If the three digits left of point are equal, there is an underflow. In that case, perform a left shift until either these three digits are different or all bits right of point are null, and adjust accordingly the exponent.
rounding
turn back from to two's complement to sign-absolute value representation and determine the sign of the result from its MSB.

Example:

A=1.1 B=-1.1 2^-1

1. alignment. Numbers are extended to 6 bits right of point.

A=+1.100000
B=-0.110000

2. two's complement
A=001.100000
B=2C(000.110000)=111.010000

3 addition

A       001.100000
+B      111.010000
=       000.110000

4 overflows none

5 underflows: shift result left 1 step and decrement exponent
001.100000 2^-1

6 rounding
001.100 2^-1

6 back to sign absolute value
+ (1.)100 2^-1

Another example with a negative result

A=1.01 B=-1.1

1. alignment. Numbers are extended to 6 bits right of point.

A=+1.010000
B=-1.100000

2. two's complement
A=001.010000
B=2C(001.100000)=110.100000

3 addition

 A      001.010000
+B      110.100000
=       111.110000

4 overflows none (none overflow can happen if signs are different)

5 underflows: shift result left 2 steps and decrement exponent by 2
111.000000 2^-2

6 rounding
111.000 2^-2 (<0)

6 back to sign absolute value
-(1.)000 2^-2

Upvotes: 3

Brendan

Reputation: 37212

In general (ignoring things like NaN), for A = B + C:

if C has larger magnitude than B, swap B and C so that you know that B must have "larger or equal" magnitude. Note: Magnitude ignores the sign bits (e.g. -6 has larger magnitude than +4 because 6 > 4).
if B and C have different signs, negate C and do subtract_internal; else do add_internal.
for subtract_internal, ignore the sign bits, subtract the magnitudes (not forgetting that B must have "larger or equal" magnitude), then set the sign of A equal to the sign of either B or C (they will have the same sign anyway).
for add_internal, ignore the sign bits, add the magnitudes, then set the sign of A equal to the sign of either B or C (they will have the same sign anyway).

Also, in general (ignoring things like NaN), for A = B - C:

if C has larger magnitude than B, swap B and C and negate both of them (e.g. A - C == (-C) - (-A)) so that you know that B must have "larger or equal" magnitude.
if B and C have different signs, negate C and do add_internal; else do subtract_internal.

Upvotes: 3

alias

Reputation: 30460

You're in luck. Assuming, you're using IEEE754 like representation (i.e., exponent is stored with appropriate bias), you can simply compare the bit strings lexicographically after a bit of massaging. Note that this assumes you already handled NaN values appropriately, since NaN's should simply propagate through your adder.

The trick is this:

You ignore the sign of -0 (i.e., if you have 10000000, then treat that as 00000000.)
If the sign bit is 1, then flip all the bits (including the sign bit)
If the sign bit is 0, then flip the sign bit (keep the others the same)

Now, you can compare these two bit-strings lexicographically, the one that comes earlier in the dictionary order is smaller. You might have to carefully arrange how you process -0, but I suspect that's not really a big issue for you.

In fact, this is precisely the reason why exponents are stored with bias, so that you can compare floats by simply treating them as unsigned numbers, after doing the bit-flip trick I mentioned above.

Upvotes: 2

Floating-point addition assembly algorithm

Answers (3)

Related Questions