Reputation: 9114
I need to divide a set of 5U11 numbers by 6.02
and would prefer to do this without casting to float and back.
5U11 means a 16-bit unsigned number, with the 11 least significant bits representing fractional part.
How should I represent 6.02
and what would be the upper bond of the error of a single calculation?
Upvotes: 3
Views: 328
Reputation: 4433
The most straightforward approach to this problem is to calculate the inverse of 6.02 as a 16 bit quantity; i.e. compute round(2^16 / 6.02) = 0x2a86. Note that the top bit is not set, so we can choose a higher dividend and recompute for better accuracy; in this case, round(2^18 / 6.02) = 0xaa1a.
Now, take your 5U11 number and do a 16x16 to 32-bit widening multiply, then shift right by (in this case) 18 bits to get your result, as a 5U11 value.
For example:
14.3562 * (2^18 / 6.02) = 625148.122 / 2^18 = 2.384
0x72d9 * 0xaa1a = 0x4c4fc40a >> 18 = 0x1313
You do lose a little accuracy doing things this way, and it’s possible to improve slightly on this naïve method (see the book Hacker’s Delight by Henry S. Warren for quite a bit more on this topic and other useful things).
Plainly if you have a machine that is capable of doing wider multiplications, you can increase the size of the dividend further than 2^18, which will increase your accuracy.
If you want to round to nearest, you should add d / 2 where d is your dividend (so in the example above, the dividend is 2^18, hence the rounding value is 2^17 or 0x20000
.
Given the small domain, it’s simplest to do an exhaustive search to establish the maximum error. With the example above and using round-to-nearest by adding 0x20000
, the maximum error turns out to be at x = 0xfa19
:
0xfa19 * 0xaa1a + 0x20000 = 0xa62e008a >> 18 = 0x298c
The actual answer should be
31.2622 / 6.02 = 5.193058
while the answer we have is
0x298c * 2^-11 = 5.193359
The error in this instance is 0.000302, or 0.62 of an LSB.
It’s possible to choose a more specific rounding constant to minimise the error bound; essentially this lets us compensate for the fact that our multiplicative inverse (here 0xaa1a
) is not precise. In this particular instance, the best value seems to be around 0x1c200
, which yields an error bound of 0.56 of an LSB.
Upvotes: 2
Reputation: 153457
A simple scaling by 100 will suffice.
uns16_t x_5U11;
uns32_t acc;
acc = x_5U11;
acc *= 100;
acc += 301; // for round to nearest rather than truncation.
acc /= 602;
Error bound: 1/2 LSbit in x_5U11
.
--
If speed is most important, than performing a multiple and divide (by shifting) as suggested by @alastai is the way to go. With proper rounding, the answer should be within +/- 1 LSBit.
If accuracy is most important than this method provides +/- 1/2 LSbit (best possible answer).
[Edit] Thanks to @Ingo Leonhardt for pointing out I had an inverted solution.
Upvotes: 2