Reputation: 9114

How to divide a fixed point number by a small float number?

I need to divide a set of 5U11 numbers by 6.02 and would prefer to do this without casting to float and back.

5U11 means a 16-bit unsigned number, with the 11 least significant bits representing fractional part.

How should I represent 6.02 and what would be the upper bond of the error of a single calculation?

Upvotes: 3

Answers (2)

al45tair

Reputation: 4433

The most straightforward approach to this problem is to calculate the inverse of 6.02 as a 16 bit quantity; i.e. compute round(2^16 / 6.02) = 0x2a86. Note that the top bit is not set, so we can choose a higher dividend and recompute for better accuracy; in this case, round(2^18 / 6.02) = 0xaa1a.

Now, take your 5U11 number and do a 16x16 to 32-bit widening multiply, then shift right by (in this case) 18 bits to get your result, as a 5U11 value.

For example:

14.3562 * (2^18 / 6.02) = 625148.122 / 2^18 = 2.384
0x72d9  * 0xaa1a        = 0x4c4fc40a >> 18  = 0x1313

You do lose a little accuracy doing things this way, and it’s possible to improve slightly on this naïve method (see the book Hacker’s Delight by Henry S. Warren for quite a bit more on this topic and other useful things).

Plainly if you have a machine that is capable of doing wider multiplications, you can increase the size of the dividend further than 2^18, which will increase your accuracy.

Updates

Rounding

If you want to round to nearest, you should add d / 2 where d is your dividend (so in the example above, the dividend is 2^18, hence the rounding value is 2^17 or 0x20000.

Error analysis

Given the small domain, it’s simplest to do an exhaustive search to establish the maximum error. With the example above and using round-to-nearest by adding 0x20000, the maximum error turns out to be at x = 0xfa19:

0xfa19 * 0xaa1a + 0x20000 = 0xa62e008a >> 18 = 0x298c

The actual answer should be

31.2622 / 6.02 = 5.193058

while the answer we have is

0x298c * 2^-11 = 5.193359

The error in this instance is 0.000302, or 0.62 of an LSB.

Improving on these results

It’s possible to choose a more specific rounding constant to minimise the error bound; essentially this lets us compensate for the fact that our multiplicative inverse (here 0xaa1a) is not precise. In this particular instance, the best value seems to be around 0x1c200, which yields an error bound of 0.56 of an LSB.

Upvotes: 2

chux

Reputation: 153457

A simple scaling by 100 will suffice.

uns16_t x_5U11;
uns32_t acc;
acc = x_5U11;
acc *= 100;
acc += 301; // for round to nearest rather than truncation.
acc /= 602;

Error bound: 1/2 LSbit in x_5U11.

If speed is most important, than performing a multiple and divide (by shifting) as suggested by @alastai is the way to go. With proper rounding, the answer should be within +/- 1 LSBit.

If accuracy is most important than this method provides +/- 1/2 LSbit (best possible answer).

[Edit] Thanks to @Ingo Leonhardt for pointing out I had an inverted solution.