FreeAntiVirus
FreeAntiVirus

Reputation: 159

Multiply float by a number using bitwise operators

I have this function that takes in the bits of a float (f) as a uint32_t. It should use bit operations and + to calculate f * 2048 and should return the bits of this value as a uint32_t.

If the result is too large to be represented as a float, +inf or -inf should be returned returned; and if f is +0, -0, +inf or -inf, or Nan, it should be returned unchanged.

uint32_t float_2048(uint32_t f) {
    uint32_t a = (f << 1) ;

    int result = a << 10;

    return result;
}

This is what I have so far but if I give it the value '1' it returns 0 instead of 2048. How do I fix this?

Some example inputs and outputs:

./float_2048 1
2048
./float_2048 3.14159265
6433.98193
./float_2048 -2.718281828e-20
-5.56704133e-17
./float_2048 1e38
inf

Upvotes: 1

Views: 1062

Answers (2)

chux
chux

Reputation: 153517

To handle all float takes more code.

Do some tests so code can assume the expected float size, matching endian and (IEEE) encoding. C does not require float as 32-bit, matching endian to an integer, not binary32 encoding, even though that is common.

Extract the biased exponent and look for its min and max value.

Max values signify NAN or infinity.

Min values are sub-normals and zero and need special handling. The significand needs to be shifted. If that result is now a normal float, re-encode it.

Biased exponents in between simple need an increment and test for exceeding FLT_MAX's exponent.

Tested successfully for all float.

#include <assert.h>
#include <stdint.h>

static_assert(sizeof(uint32_t) == sizeof(float), "Unexpected float size");

#define IEEE_MASK_BIASED_EXPO     0x7F800000u
#define IEEE_MASK_BIASED_EXPO_LSB 0x00800000u
#define IEEE_MASK_SIGNIFICAND     0x007FFFFFu
#define IEEE_SIGNIFICAND_MAX      0x00FFFFFFu
#define IEEE_INFINITY             0x7F800000u

// Scale value by 2048
uint32_t float_2048(uint32_t f) {
  uint32_t expo = f & IEEE_MASK_BIASED_EXPO;
  // Test for infinity or NAN
  if (expo == IEEE_MASK_BIASED_EXPO) {
    return f;
  }
  // Sub-normal and zero test
  if (expo == 0) {
    uint64_t sig = f & IEEE_MASK_SIGNIFICAND;
    sig <<= 11; // *= 2048;
    // If value now a normal one
    if (sig > IEEE_MASK_SIGNIFICAND) {
      expo += IEEE_MASK_BIASED_EXPO_LSB;
      while (sig > IEEE_SIGNIFICAND_MAX) {
        sig >>= 1;
        expo += IEEE_MASK_BIASED_EXPO_LSB;
      }
      f = (f & ~IEEE_MASK_BIASED_EXPO) | (expo & IEEE_MASK_BIASED_EXPO);
    }
    f = (f & ~IEEE_MASK_SIGNIFICAND) | (sig & IEEE_MASK_SIGNIFICAND);
  } else {
    expo += 11 * IEEE_MASK_BIASED_EXPO_LSB; // *= 2048;
    if (expo >= IEEE_MASK_BIASED_EXPO) {
      f &= ~(IEEE_MASK_BIASED_EXPO | IEEE_MASK_SIGNIFICAND);
      f |= IEEE_INFINITY;
    } else {
      f = (f & ~IEEE_MASK_BIASED_EXPO) | (expo & IEEE_MASK_BIASED_EXPO);
    }
  }
  return f;
}

Test code.

#include <stdio.h>
#include <stdlib.h>

typedef union {
  uint32_t u32;
  float f;
} fu32;

int main(void ) {
  // Lightweight test to see if endian matches and IEEE encoding
  assert((fu32) {.u32 = 0x87654321}.f == -1.72477726182e-34f);
  float f[] = {0, FLT_TRUE_MIN, FLT_MIN, 1, FLT_MAX};
  size_t n = sizeof f/sizeof f[0];
  for (size_t i = 0; i<n; i++) {
    fu32 x = { .f = f[i] };
    float y0 = x.f * 2048.0f;
    fu32 y1 = { .u32 = float_2048(x.u32) };
    if (memcmp(&y0, &y1.f, sizeof y0)) {
      printf("%.9g %.9g\n", y0, y1.f);
    }
  }
  fu32 x = { .u32 = 0 };
  do {
    fu32 y0 = { .f = isnan(x.f) ? x.f : x.f * 2048.0f };
    fu32 y1 = { .u32 = float_2048(x.u32) };
    if (memcmp(&y0.f, &y1.f, sizeof y0)) {
      printf("%.9g %.9g\n", y0.f, y1.f);
      printf("%08lx %08lx %08lx\n", (unsigned long) x.u32,
          (unsigned long) y0.u32, (unsigned long) y1.u32);
      break;
    }
    x.u32++;
  } while (x.u32 != 0);
  puts("Done");
}

Upvotes: 1

Adrian Mole
Adrian Mole

Reputation: 51845

As mentioned in the comments, to multiply a floating-point number by a power of 2 (assuming, as is likely, that it is represented in IEEE-754 format), we can just add that power to the (binary) exponent part of the representation.

For a single-precision (32-bit) float value, that exponent is stored in bits 30-23 and the following code shows how to extract those, add the required value (11, because 2048 = 211), then replace the exponent bits with that modified value.

uint32_t fmul2048(uint32_t f)
{
    #define EXPONENT 0x7F800000u
    #define SIGN_BIT 0x80000000u
    uint32_t expon = (f & EXPONENT) >> 23; // Get exponent value
    f &= ~EXPONENT;  // Remove old exponent
    expon += 11;     // Adding 11 to exponent multiplies by 2^11 (= 2048);
    if (expon > 254) return EXPONENT | (f & SIGN_BIT); // Too big: return +/- Inf
    f |= (expon << 23); // Insert modified exponent
    return f;
}

There will, no-doubt, be some "bit trickery" that can be applied to make the code smaller and/or more efficient; but I have avoided doing so in order to keep the code clear. I have also included one error check (for a too large exponent) and the code returns the standard representation for +/- Infinity (all exponent bits set to 1, and keeping the original sign) if that test fails. (I leave other error-checking as an "exercise for the reader".)

Upvotes: 2

Related Questions