Reputation: 586

C++ Floating Point Addition (from scratch): Negative results cannot be computed

I am implementing a floating point addition program from scratch, following the methodology listed out in this PDF: https://www.cs.colostate.edu/~cs270/.Fall20/resources/FloatingPointExample.pdf

The main issue I am having is that addition works when the result is positive (e.x. -10 + 12, 3 + 5.125), but the addition does not work when the result is negative. This is because do not understand how to implement the following step:

Step 5: Convert result from 2’s complement to signed magnitude
If the result is negative, convert the mantissa back to signed magnitude by inverting the bits and adding 1. The result is
positive in this example, so nothing needs to be done.

How do I determine if the result is negative without using floating point addition (I am not allowed to use any floating or double adds)? Of course I can see if the current and the next floats are negative and see their cumulative quantities, but that would defeat the purposes of this assignment.

If given only the following:

Sign bit, exponent, and mantissa of X
Sign bit, exponent, and mantissa of Y
Mantissa and exponent of Z

How do I determine whether Z = X + Y is negative just with the above data and not using any floating point addition?

Upvotes: 2

Answers (4)

Cem

Reputation: 1296

If you are following the PDF you posted, you should have converted the numbers to 2's complement at Step 3. After the addition in Step 4, you have the result in 2's complement. (Result of adding the shifted numbers)

To check if the result is negative, you need to check the leftmost bit (the sign bit) in the resulting bit pattern. In 2's complement, this bit is 1 for negative numbers, and 0 for nonnegative numbers.

sign = signBit;
if (signBit) {
  result = ~result + 1;
}

If you are using unsigned integers to hold the bit pattern, you could make them of a fixed size, so that you are able to find the sign bit using shifts later.

uint64_t result;
...
signBit = (result >> 63) & 1;

Upvotes: 2

old_timer

Reputation: 71536

The only difference between grade school math and what we do with floating point is that we have twos complement (base 2 vs base 10 is not really relevant, just makes life easier). So if you made it through grade school you know how all of this works.

In decimal in grade school you align the decimal points and then do the math. With floating point we shift the smaller number and discard it's mantissa (sorry fraction) bits to line it up with the larger number.

In grade school if doing subtraction you subtract the smaller number from the larger number once you resolve the identities

a - (-b) = a + b
-a + b = b - a

and so on so that you either have

n - m

n + m

And then you do the math. Apply the sign based on what you had to do to get a-b or a+b.

The beauty of twos complement is that a negation or negative is invert and add one, which feeds nicely into logic.

a - b = a + (-b) = a + (~b) + 1

so you do not re-arrange the operands but you might have to negate the second one. Also you do not have to remember the sign of the result the result tells you its sign.

So align the points put it in the form

a + b  
a + (-b)

Where a can be positive or negative but b's sign and the operation may need to negate b.

Do the addition.

If the result is negative, negate the result into a positive

Normalize

IEEE is only involved in the desire to have the 1.fraction be positive, other floating point formats allow for negative whole.fraction and do not negate, simply normalize. The rest of it is just grade school math (plus twos complement)

Some examples

2 + 4

in binary the numbers are

+10
+100

which converted to a normalized form are

+1.0  * 2^1
+1.00 * 2^2

need same exponent (align the point)

+0.10 * 2^2
+1.00 * 2^2

both are positive so no change just do the addition

this is the base form, I put more sign extension out front than needed to make the sign of the result much easier to see.

      0
 000010
+000100
=======

fill it in

 000000
 000010
+000100
========
 000110

result is positive (msbit of result is zero) so normalize

+1.10 * 2^2

4+5

100
101

+1.00 2^2
+1.01 2^2

same exponent both positive

      0
 000100
+000101
=======

 001000
 000100
+000101
=======
 001001

result is positive so normalize

+1.001 * 2^3

4 - 2

100
10

+1.00 * 2^2
+1.0  * 2^1

need the same exponent

+1.00 * 2^2
+0.10 * 2^2

subtract a - b = a + (-b)

     1 <--- add one
 00100
+11101 <--- invert
=======

fill it in

 11011
 00100
+11101
=======
 00010

result is positive so normalize

+1.0 * 2^1

2 - 4

10
100

+1.0 * 2^1
+1.00 * 2^2

make same exponent

+0.10 * 2^2
+1.00 * 2^2

do the math

a - b = a + (-b)

      1
 000010
+111011
========

fill it in

 000111
 000010
+111011
========
 111110

result is negative so negate (0 - n)

 000011  <--- add one  
 000000
+000001  <--- invert
=========
 000010

normalize

-1.0 * 2^1

Upvotes: 0

MSalters

Reputation: 179907

The key insight is that many floating-point formats keep the sign and mantissa separate, so the mantissa is an unsigned integer. The sign and mantissa can be trivially combined to create a signed integer. You can then use signed integer arithmetic to add or subtract the two mantissa's of your floating-point number.

Upvotes: 3

Sneftel

Reputation: 41474

At step 5, you’ve already added the mantissas. To determine whether the result is positive or negative, just check the sign bit of that sum.

Upvotes: 0

C++ Floating Point Addition (from scratch): Negative results cannot be computed

Answers (4)

Related Questions