Reputation: 586
I am implementing a floating point addition program from scratch, following the methodology listed out in this PDF: https://www.cs.colostate.edu/~cs270/.Fall20/resources/FloatingPointExample.pdf
The main issue I am having is that addition works when the result is positive (e.x. -10 + 12, 3 + 5.125), but the addition does not work when the result is negative. This is because do not understand how to implement the following step:
Step 5: Convert result from 2’s complement to signed magnitude
If the result is negative, convert the mantissa back to signed magnitude by inverting the bits and adding 1. The result is
positive in this example, so nothing needs to be done.
How do I determine if the result is negative without using floating point addition (I am not allowed to use any floating or double adds)? Of course I can see if the current and the next floats are negative and see their cumulative quantities, but that would defeat the purposes of this assignment.
If given only the following:
How do I determine whether Z = X + Y
is negative just with the above data and not using any floating point addition?
Upvotes: 2
Views: 380
Reputation: 1296
If you are following the PDF you posted, you should have converted the numbers to 2's complement at Step 3. After the addition in Step 4, you have the result in 2's complement. (Result of adding the shifted numbers)
To check if the result is negative, you need to check the leftmost bit (the sign bit) in the resulting bit pattern. In 2's complement, this bit is 1 for negative numbers, and 0 for nonnegative numbers.
sign = signBit;
if (signBit) {
result = ~result + 1;
}
If you are using unsigned integers to hold the bit pattern, you could make them of a fixed size, so that you are able to find the sign bit using shifts later.
uint64_t result;
...
signBit = (result >> 63) & 1;
Upvotes: 2
Reputation: 71536
The only difference between grade school math and what we do with floating point is that we have twos complement (base 2 vs base 10 is not really relevant, just makes life easier). So if you made it through grade school you know how all of this works.
In decimal in grade school you align the decimal points and then do the math. With floating point we shift the smaller number and discard it's mantissa (sorry fraction) bits to line it up with the larger number.
In grade school if doing subtraction you subtract the smaller number from the larger number once you resolve the identities
a - (-b) = a + b
-a + b = b - a
and so on so that you either have
n - m
or
n + m
And then you do the math. Apply the sign based on what you had to do to get a-b or a+b.
The beauty of twos complement is that a negation or negative is invert and add one, which feeds nicely into logic.
a - b = a + (-b) = a + (~b) + 1
so you do not re-arrange the operands but you might have to negate the second one. Also you do not have to remember the sign of the result the result tells you its sign.
So align the points put it in the form
a + b
a + (-b)
Where a can be positive or negative but b's sign and the operation may need to negate b.
Do the addition.
If the result is negative, negate the result into a positive
Normalize
IEEE is only involved in the desire to have the 1.fraction be positive, other floating point formats allow for negative whole.fraction and do not negate, simply normalize. The rest of it is just grade school math (plus twos complement)
Some examples
2 + 4
in binary the numbers are
+10
+100
which converted to a normalized form are
+1.0 * 2^1
+1.00 * 2^2
need same exponent (align the point)
+0.10 * 2^2
+1.00 * 2^2
both are positive so no change just do the addition
this is the base form, I put more sign extension out front than needed to make the sign of the result much easier to see.
0
000010
+000100
=======
fill it in
000000
000010
+000100
========
000110
result is positive (msbit of result is zero) so normalize
+1.10 * 2^2
4+5
100
101
+1.00 2^2
+1.01 2^2
same exponent both positive
0
000100
+000101
=======
001000
000100
+000101
=======
001001
result is positive so normalize
+1.001 * 2^3
4 - 2
100
10
+1.00 * 2^2
+1.0 * 2^1
need the same exponent
+1.00 * 2^2
+0.10 * 2^2
subtract a - b = a + (-b)
1 <--- add one
00100
+11101 <--- invert
=======
fill it in
11011
00100
+11101
=======
00010
result is positive so normalize
+1.0 * 2^1
2 - 4
10
100
+1.0 * 2^1
+1.00 * 2^2
make same exponent
+0.10 * 2^2
+1.00 * 2^2
do the math
a - b = a + (-b)
1
000010
+111011
========
fill it in
000111
000010
+111011
========
111110
result is negative so negate (0 - n)
000011 <--- add one
000000
+000001 <--- invert
=========
000010
normalize
-1.0 * 2^1
Upvotes: 0
Reputation: 179907
The key insight is that many floating-point formats keep the sign and mantissa separate, so the mantissa is an unsigned integer. The sign and mantissa can be trivially combined to create a signed integer. You can then use signed integer arithmetic to add or subtract the two mantissa's of your floating-point number.
Upvotes: 3
Reputation: 41474
At step 5, you’ve already added the mantissas. To determine whether the result is positive or negative, just check the sign bit of that sum.
Upvotes: 0