Reputation: 469
I'm using avr-gcc where sizeof(double)
and sizeof(float)
are both 4
and I'm having an issue with double arithmetic to get the correct integer result:
// x is some value between 8.0 and 9.6103
double x = 9.6103;
uint32_t r = pow(x,2) * 8813377.768984962;
The correct value of r should be 813984763
rounded down but the actual result is 813984768
.
How can I get the correct integer result?
I've tried to split the calculation like this:
uint32_t r1 = pow(x,2) * 8813;
double d1 = pow(x,2) * .77768984962;
uint32_t r = r1 + d1;
But this still suffers from precision issues i.e I can't seem to get 813984763
exactly and I'm only interested in that the integer part of the result is correct. Any ideas?
Upvotes: 0
Views: 131
Reputation: 141890
You could scale it up and use 128-bit integers to do the arithmetic. 128-bit is soooo much you could just multiply it all to integers.
double x = 9.6103;
uint128_t y = x * 10000; // = 96103 / 10000
uint128_t c = 8813377768984962; // = 8813377.768984962 * 1000000000
uint32_t r = y * y * c / 10000/10000 /1000000000;
// max y * y * c = 96103 * 96103 * 8813377768984962 =
// = 81398476378849607561973858
// UINT128_MAX = 340282366920938463463374607431768211456
// ^^ is way more, so it will not overflow.
Your platform most probably does not support __uint128_t
GCC extension, so you could write your own library for that. There are endless 128-bit libraries in C++ on github - port one to C (or find one in C) and use it.
Well, I got some free time and I always wanted to have a C uint128 library, so I took this library https://github.com/calccrypto/uint128_t and ported it to C and wrote an executable that does the same computations as presented above and compiled it for atmega128 with avr-gcc -Os
and run avr-nm -td --sort-size
over the result. These are the biggest 5 symbols in the result and the whole program has ~12KB of .text
. So, a bit of space is needed for this solution to work.
00000642 T how_to_split_double_multiplication
00000706 T kuint128_rshift
00000762 T kuint128_lshift
00003104 T kuint128_mul
00004594 T kuint128_divmod
Upvotes: 0
Reputation: 215577
A float
cannot represent the precision you need for this value (813984763), much less for the calculation, and as you've noted avr-gcc
has wrongly redefined double
to be the same as float
.
The closest representable values in float
are:
Upvotes: 1