Reputation: 11

Floating point truncation vs rounding by hand

I am trying to convert a decimal to a floating point integer on using 32 bit registers. I have to do this by hand (pencil and paper) so far my number is

1.11010110111100110100010011(base 2) x 2^26

Now I know that the mantissa can only store 2^23 bites so I need to show what it would look like using rounding and without rounding. My question is what determines rounding? I know truncation will result in this

1.11010110111100110100010(base 2) x 2^23

does rounding just look to the bit to the right and round up to 1 if it equals a 1 and down to 0 if it equals a zero?

What if the number was

1.11010110111100110100010111(base 2) x 2^26 where there is a one to the right?

What if the bit at 2^3 was a 1 and the bit at 2^2 (to the right) was a 1 like in this example

1.11010110111100110100011111(base 2) x 2^26

Thanks I am just a little unclear about rounding at this stage.

Upvotes: 1

Answers (2)

Rudy Velthuis

Reputation: 28806

Rounding is generally done to the nearest more significant digit available. But if the value is exactly between those, i.e. if the highest bit you want to get rid of is 1 and the others are 0, there are several so called tie breaking rules:

truncating (towards 0)
up (towards +infinity)
down (towards -infinity)
away from 0
banker's rounding (to the nearest even more significant digit).

Which rule is applied is something that must be defined. AFAIK, most FPUs use banker's rounding as a default.

In our case, you throw away 3 binary digits. 000 are simply truncated; 001-011 always round down; 101-111 always round up and 100 invokes the tie breaking rules. If the result of these rules is to round up, you add one least significant bit to the result and, if necessary, shift accordingly.

In your first case, you simply truncate the bits, since they are below 100, but if this is the value

1.11010110111100110100011111

and you want to remove 3 bits, it is first truncated to

1.11010110111100110100011

but because the bits you threw away were 111, you round up, so you add 1 bit, and it becomes

1.11010110111100110100100

IOW, the lowest bits 011 become 100

Upvotes: 1

Paul R

Reputation: 212979

Truncation and rounding of binary numbers work much like they do for decimals. In theory you would need to look at as many bits as are available to do "correct" rounding, but in practice most hardware implementations use 1 or 2 bits to the right to determine whether to round up.

Upvotes: 1

Floating point truncation vs rounding by hand

Answers (2)

Related Questions