Reputation: 30765
I have a co-processor which does not have floating point support. I tried to use 32 bit fix point, but it is unable to work on very small numbers. My numbers range from 1 to 1e-18. One way is to use floating point emulation, but it is too slow. Can we make it faster in this case where we know the numbers won't be greater than 1 and smaller than 1e-18. Or is there a way to make fix point work on very small numbers.
Upvotes: 1
Views: 1028
Reputation: 222660
It is not possible for a 32-bit fixed-point encoding to represent numbers from 10–18 to 1. This is immediately obvious from the fact that the span from 10-18 is a ratio of 1018, but the non-zero encodings of a 32-bit integer span a ratio of less than 232, which is much less than 1018. Therefore, no choice of scale for the fixed-point encoding will provide the desired span.
So a 32-bit fixed-point encoding will not work, and you must use some other technique.
In some applications, it may be suitable to use multiple fixed-point encodings. That is, various input values would be encoded with a fixed-point encoding but each with a scale suitable to it, and intermediate values and the outputs would also have customized scales. Obviously, this is possible only if suitable scales can be determined at design time. Otherwise, you should abandon 32-bit fixed-point encodings and consider alternatives.
Upvotes: 2
Reputation: 41180
Use 64 bit fixed point and be done with it.
Compared with 32 bit fixed point it will be four times slower for multiplication, but it will still be far more efficient than float emulation.
Upvotes: 1
Reputation: 62048
Will simplified 24-bit floating point be fast enough and accurate enough?:
#include <stdio.h>
#include <limits.h>
#if UINT_MAX >= 0xFFFFFFFF
typedef unsigned myfloat;
#else
typedef unsigned long myfloat;
#endif
#define MF_EXP_BIAS 0x80
myfloat mfadd(myfloat a, myfloat b)
{
unsigned ea = a >> 16, eb = b >> 16;
if (ea > eb)
{
a &= 0xFFFF;
b = (b & 0xFFFF) >> (ea - eb);
if ((a += b) > 0xFFFF)
a >>= 1, ++ea;
return a | ((myfloat)ea << 16);
}
else if (eb > ea)
{
b &= 0xFFFF;
a = (a & 0xFFFF) >> (eb - ea);
if ((b += a) > 0xFFFF)
b >>= 1, ++eb;
return b | ((myfloat)eb << 16);
}
else
{
return (((a & 0xFFFF) + (b & 0xFFFF)) >> 1) | ((myfloat)++ea << 16);
}
}
myfloat mfmul(myfloat a, myfloat b)
{
unsigned ea = a >> 16, eb = b >> 16, e = ea + eb - MF_EXP_BIAS;
myfloat p = ((a & 0xFFFF) * (b & 0xFFFF)) >> 16;
return p | ((myfloat)e << 16);
}
myfloat double2mf(double x)
{
myfloat f;
unsigned e = MF_EXP_BIAS + 16;
if (x <= 0)
return 0;
while (x < 0x8000)
x *= 2, --e;
while (x >= 0x10000)
x /= 2, ++e;
f = x;
return f | ((myfloat)e << 16);
}
double mf2double(myfloat f)
{
double x;
unsigned e = (f >> 16) - 16;
if ((f & 0xFFFF) == 0)
return 0;
x = f & 0xFFFF;
while (e > MF_EXP_BIAS)
x *= 2, --e;
while (e < MF_EXP_BIAS)
x /= 2, ++e;
return x;
}
int main(void)
{
double testConvData[] = { 1e-18, .25, 0.3333333, .5, 1, 2, 3.141593, 1e18 };
unsigned i;
for (i = 0; i < sizeof(testConvData) / sizeof(testConvData[0]); i++)
printf("%e -> 0x%06lX -> %e\n",
testConvData[i],
(unsigned long)double2mf(testConvData[i]),
mf2double(double2mf(testConvData[i])));
printf("300 * 5 = %e\n", mf2double(mfmul(double2mf(300),double2mf(5))));
printf("500 + 3 = %e\n", mf2double(mfadd(double2mf(500),double2mf(3))));
printf("1e18 * 1e-18 = %e\n", mf2double(mfmul(double2mf(1e18),double2mf(1e-18))));
printf("1e-18 + 2e-18 = %e\n", mf2double(mfadd(double2mf(1e-18),double2mf(2e-18))));
printf("1e-16 + 1e-18 = %e\n", mf2double(mfadd(double2mf(1e-16),double2mf(1e-18))));
return 0;
}
Output (ideone):
1.000000e-18 -> 0x459392 -> 9.999753e-19
2.500000e-01 -> 0x7F8000 -> 2.500000e-01
3.333333e-01 -> 0x7FAAAA -> 3.333282e-01
5.000000e-01 -> 0x808000 -> 5.000000e-01
1.000000e+00 -> 0x818000 -> 1.000000e+00
2.000000e+00 -> 0x828000 -> 2.000000e+00
3.141593e+00 -> 0x82C90F -> 3.141541e+00
1.000000e+18 -> 0xBCDE0B -> 9.999926e+17
300 * 5 = 1.500000e+03
500 + 3 = 5.030000e+02
1e18 * 1e-18 = 9.999390e-01
1e-18 + 2e-18 = 2.999926e-18
1e-16 + 1e-18 = 1.009985e-16
Subtraction is left as an exercise. Ditto for better conversion routines.
Upvotes: 1
Reputation: 20027
In embedded systems I'd suggest using 16+32, 16+16, 8+16 or 8+24 bit redundant floating point representation, where each number is simply M * 2^exp.
In this case you can choose to represent zero with both M=0 and exp=0; There are 16-32 representations for each power of 2 -- and that mainly makes comparison a bit harder than typically. Also one can postpone normalization e.g. after subtraction.
Upvotes: 0