MetallicPriest
MetallicPriest

Reputation: 30765

Floating point emulation or Fixed Point for numbers in a given range

I have a co-processor which does not have floating point support. I tried to use 32 bit fix point, but it is unable to work on very small numbers. My numbers range from 1 to 1e-18. One way is to use floating point emulation, but it is too slow. Can we make it faster in this case where we know the numbers won't be greater than 1 and smaller than 1e-18. Or is there a way to make fix point work on very small numbers.

Upvotes: 1

Views: 1028

Answers (4)

Eric Postpischil
Eric Postpischil

Reputation: 222660

It is not possible for a 32-bit fixed-point encoding to represent numbers from 10–18 to 1. This is immediately obvious from the fact that the span from 10-18 is a ratio of 1018, but the non-zero encodings of a 32-bit integer span a ratio of less than 232, which is much less than 1018. Therefore, no choice of scale for the fixed-point encoding will provide the desired span.

So a 32-bit fixed-point encoding will not work, and you must use some other technique.

In some applications, it may be suitable to use multiple fixed-point encodings. That is, various input values would be encoded with a fixed-point encoding but each with a scale suitable to it, and intermediate values and the outputs would also have customized scales. Obviously, this is possible only if suitable scales can be determined at design time. Otherwise, you should abandon 32-bit fixed-point encodings and consider alternatives.

Upvotes: 2

Doug Currie
Doug Currie

Reputation: 41180

Use 64 bit fixed point and be done with it.

Compared with 32 bit fixed point it will be four times slower for multiplication, but it will still be far more efficient than float emulation.

Upvotes: 1

Alexey Frunze
Alexey Frunze

Reputation: 62048

Will simplified 24-bit floating point be fast enough and accurate enough?:

#include <stdio.h>
#include <limits.h>

#if UINT_MAX >= 0xFFFFFFFF
typedef unsigned myfloat;
#else
typedef unsigned long myfloat;
#endif

#define MF_EXP_BIAS 0x80

myfloat mfadd(myfloat a, myfloat b)
{
  unsigned ea = a >> 16, eb = b >> 16;
  if (ea > eb)
  {
    a &= 0xFFFF;
    b = (b & 0xFFFF) >> (ea - eb);
    if ((a += b) > 0xFFFF)
      a >>= 1, ++ea;
    return a | ((myfloat)ea << 16);
  }
  else if (eb > ea)
  {
    b &= 0xFFFF;
    a = (a & 0xFFFF) >> (eb - ea);
    if ((b += a) > 0xFFFF)
      b >>= 1, ++eb;
    return b | ((myfloat)eb << 16);
  }
  else
  {
    return (((a & 0xFFFF) + (b & 0xFFFF)) >> 1) | ((myfloat)++ea << 16);
  }
}

myfloat mfmul(myfloat a, myfloat b)
{
  unsigned ea = a >> 16, eb = b >> 16, e = ea + eb - MF_EXP_BIAS;
  myfloat p = ((a & 0xFFFF) * (b & 0xFFFF)) >> 16;
  return p | ((myfloat)e << 16);
}

myfloat double2mf(double x)
{
  myfloat f;
  unsigned e = MF_EXP_BIAS + 16;
  if (x <= 0)
    return 0;
  while (x < 0x8000)
    x *= 2, --e;
  while (x >= 0x10000)
    x /= 2, ++e;
  f = x;
  return f | ((myfloat)e << 16);
}

double mf2double(myfloat f)
{
  double x;
  unsigned e = (f >> 16) - 16;
  if ((f & 0xFFFF) == 0)
    return 0;
  x = f & 0xFFFF;
  while (e > MF_EXP_BIAS)
    x *= 2, --e;
  while (e < MF_EXP_BIAS)
    x /= 2, ++e;
  return x;
}

int main(void)
{
  double testConvData[] = { 1e-18, .25, 0.3333333, .5, 1, 2, 3.141593, 1e18 };
  unsigned i;
  for (i = 0; i < sizeof(testConvData) / sizeof(testConvData[0]); i++)
    printf("%e -> 0x%06lX -> %e\n",
           testConvData[i],
           (unsigned long)double2mf(testConvData[i]),
           mf2double(double2mf(testConvData[i])));

  printf("300 * 5 = %e\n", mf2double(mfmul(double2mf(300),double2mf(5))));
  printf("500 + 3 = %e\n", mf2double(mfadd(double2mf(500),double2mf(3))));
  printf("1e18 * 1e-18 = %e\n", mf2double(mfmul(double2mf(1e18),double2mf(1e-18))));
  printf("1e-18 + 2e-18 = %e\n", mf2double(mfadd(double2mf(1e-18),double2mf(2e-18))));
  printf("1e-16 + 1e-18 = %e\n", mf2double(mfadd(double2mf(1e-16),double2mf(1e-18))));

  return 0;
}

Output (ideone):

1.000000e-18 -> 0x459392 -> 9.999753e-19
2.500000e-01 -> 0x7F8000 -> 2.500000e-01
3.333333e-01 -> 0x7FAAAA -> 3.333282e-01
5.000000e-01 -> 0x808000 -> 5.000000e-01
1.000000e+00 -> 0x818000 -> 1.000000e+00
2.000000e+00 -> 0x828000 -> 2.000000e+00
3.141593e+00 -> 0x82C90F -> 3.141541e+00
1.000000e+18 -> 0xBCDE0B -> 9.999926e+17
300 * 5 = 1.500000e+03
500 + 3 = 5.030000e+02
1e18 * 1e-18 = 9.999390e-01
1e-18 + 2e-18 = 2.999926e-18
1e-16 + 1e-18 = 1.009985e-16

Subtraction is left as an exercise. Ditto for better conversion routines.

Upvotes: 1

Aki Suihkonen
Aki Suihkonen

Reputation: 20027

In embedded systems I'd suggest using 16+32, 16+16, 8+16 or 8+24 bit redundant floating point representation, where each number is simply M * 2^exp.

In this case you can choose to represent zero with both M=0 and exp=0; There are 16-32 representations for each power of 2 -- and that mainly makes comparison a bit harder than typically. Also one can postpone normalization e.g. after subtraction.

Upvotes: 0

Related Questions