user1410966
user1410966

Reputation: 19

Bit shifting for fixed point arithmetic on float numbers in C

i wrote the following test code to check fixed point arithmetic and bit shifting.

void main(){
    float x = 2;
    float y = 3;
    float z = 1;
    unsigned int * px = (unsigned int *) (& x);
    unsigned int * py = (unsigned int *) (& y);
    unsigned int * pz = (unsigned int *) (& z);
    *px <<= 1;
    *py <<= 1;
    *pz <<= 1;
    *pz =*px + *py;
    *px >>= 1;
    *py >>= 1;
    *pz >>= 1;
    printf("%f %f %f\n",x,y,z);
  }

The result is 2.000000 3.000000 0.000000

Why is the last number 0? I was expecting to see a 5.000000 I want to use some kind of fixed point arithmetic to bypass the use of floating point numbers on an image processing application. Which is the best/easiest/most efficient way to turn my floating point arrays into integers? Is the above "tricking the compiler" a robust workaround? Any suggestions?

Upvotes: 1

Views: 7409

Answers (4)

JeremyP
JeremyP

Reputation: 86651

It's probable that your compiler uses IEEE 754 format for floats, which in bit terms, looks like this:

SEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFF
^ bit 31                       ^ bit 0

S is the sign bit s = 1 implies the number is negative.

E bits are the exponent. There are 8 exponent bits giving a range of 0 - 255 but the exponent is biased - you need to subtract 127 to get the true exponent.

F bits are the fraction part, however, you need to imagine an invisible 1 on the front so the fraction is always 1.something and all you see are the binary fraction digits.

The number 2 is 1 x 21 = 1 x 2128 - 127 so is encoded as

01000000000000000000000000000000

So if you use a bit shift to shift it right you get

10000000000000000000000000000000

which by convention is -0 in IEEE754, so rather than multiplying your number by 2 your shift has made it zero.

The number 3 is [1 + 0.5] x 2128 - 127

which is represented as

01000000010000000000000000000000

Shifting that left gives you

10000000100000000000000000000000

which is -1 x 2-126 or some very small number.

You can do the same for z, but you probably get the idea that shifting just screws up floating point numbers.

Upvotes: 2

glglgl
glglgl

Reputation: 91017

What you are doing are cruelties to the numbers.

First, you assign values to float variables. How they are stored is system dependant, but normally, IEEE 754 format is used. So your variables internally look like

x = 2.0 = 1 * 2^1   : sign = 0, mantissa = 1,   exponent = 1 -> 0 10000000 00000000000000000000000 = 0x40000000
y = 3.0 = 1.5 * 2^1 : sign = 0, mantissa = 1.5, exponent = 1 -> 0 10000000 10000000000000000000000 = 0x40400000
z = 1.0 = 1 * 2^0   : sign = 0, mantissa = 1,   exponent = 0 -> 0 01111111 00000000000000000000000 = 0x3F800000

If you do some bit shiftng operations on these numbers, you mix up the borders between sign, exponent and mantissa and so anything can, may and will happen.

In your case:

  • your 2.0 becomes 0x80000000, resulting in -0.0,
  • your 3.0 becomes 0x80800000, resulting in -1.1754943508222875e-38,
  • your 1.0 becomes 0x7F000000, resulting in 1.7014118346046923e+38.

The latter you lose by adding -0.0 and -1.1754943508222875e-38, which becomes the latter, namely 0x80800000, which should be, after >>ing it by 1, 3.0 again. I don't know why it isn't, probably because I made a mistake here.

What stays is that you cannot do bit-shifting on floats an expect a reliable result.

I would consider converting them to integer or other fixed-point on the ARM and sending them over the line as they are.

Upvotes: 2

Tobias Schlegel
Tobias Schlegel

Reputation: 3970

Fixed point doesn't work that way. What you want to do is something like this:

void main(){
    // initing 8bit fixed point numbers
    unsigned int x = 2 << 8;
    unsigned int y = 3 << 8;
    unsigned int z = 1 << 8;

    // adding two numbers
    unsigned int a = x + y;

    // multiplying two numbers with fixed point adjustment
    unsigned int b = (x * y) >> 8;

    // use numbers
    printf("%d %d\n", a >> 8, b >> 8);
  }

Upvotes: 1

osgx
osgx

Reputation: 94185

If you want to use fixed point, dont use type 'float' or 'double' because them has internal structure. Floats and Doubles have specific bit for sign; some bits for exponent, some for mantissa (take a look on color image here); so they inherently are floating point.

You should either program fixed point by hand storing data in integer type, or use some fixed-point library (or language extension).

There is a description of Floating point extensions implemented in GCC: http://gcc.gnu.org/onlinedocs/gcc/Fixed_002dPoint.html

There is some MACRO-based manual implementation of fixed-point for C: http://www.eetimes.com/discussion/other/4024639/Fixed-point-math-in-C

Upvotes: 3

Related Questions