Reputation: 159
I have this function that takes in the bits of a float
(f
) as a uint32_t
. It should use bit operations and + to calculate f * 2048
and should return the bits of this value as a uint32_t
.
If the result is too large to be represented as a float
, +inf
or -inf
should be returned returned; and if f
is +0
, -0
, +inf
or -inf
, or Nan
, it should be returned unchanged.
uint32_t float_2048(uint32_t f) {
uint32_t a = (f << 1) ;
int result = a << 10;
return result;
}
This is what I have so far but if I give it the value '1' it returns 0 instead of 2048. How do I fix this?
Some example inputs and outputs:
./float_2048 1
2048
./float_2048 3.14159265
6433.98193
./float_2048 -2.718281828e-20
-5.56704133e-17
./float_2048 1e38
inf
Upvotes: 1
Views: 1062
Reputation: 153517
To handle all float
takes more code.
Do some tests so code can assume the expected float
size, matching endian and (IEEE) encoding. C does not require float
as 32-bit, matching endian to an integer, not binary32 encoding, even though that is common.
Extract the biased exponent and look for its min and max value.
Max values signify NAN or infinity.
Min values are sub-normals and zero and need special handling. The significand needs to be shifted. If that result is now a normal float
, re-encode it.
Biased exponents in between simple need an increment and test for exceeding FLT_MAX
's exponent.
Tested successfully for all float
.
#include <assert.h>
#include <stdint.h>
static_assert(sizeof(uint32_t) == sizeof(float), "Unexpected float size");
#define IEEE_MASK_BIASED_EXPO 0x7F800000u
#define IEEE_MASK_BIASED_EXPO_LSB 0x00800000u
#define IEEE_MASK_SIGNIFICAND 0x007FFFFFu
#define IEEE_SIGNIFICAND_MAX 0x00FFFFFFu
#define IEEE_INFINITY 0x7F800000u
// Scale value by 2048
uint32_t float_2048(uint32_t f) {
uint32_t expo = f & IEEE_MASK_BIASED_EXPO;
// Test for infinity or NAN
if (expo == IEEE_MASK_BIASED_EXPO) {
return f;
}
// Sub-normal and zero test
if (expo == 0) {
uint64_t sig = f & IEEE_MASK_SIGNIFICAND;
sig <<= 11; // *= 2048;
// If value now a normal one
if (sig > IEEE_MASK_SIGNIFICAND) {
expo += IEEE_MASK_BIASED_EXPO_LSB;
while (sig > IEEE_SIGNIFICAND_MAX) {
sig >>= 1;
expo += IEEE_MASK_BIASED_EXPO_LSB;
}
f = (f & ~IEEE_MASK_BIASED_EXPO) | (expo & IEEE_MASK_BIASED_EXPO);
}
f = (f & ~IEEE_MASK_SIGNIFICAND) | (sig & IEEE_MASK_SIGNIFICAND);
} else {
expo += 11 * IEEE_MASK_BIASED_EXPO_LSB; // *= 2048;
if (expo >= IEEE_MASK_BIASED_EXPO) {
f &= ~(IEEE_MASK_BIASED_EXPO | IEEE_MASK_SIGNIFICAND);
f |= IEEE_INFINITY;
} else {
f = (f & ~IEEE_MASK_BIASED_EXPO) | (expo & IEEE_MASK_BIASED_EXPO);
}
}
return f;
}
Test code.
#include <stdio.h>
#include <stdlib.h>
typedef union {
uint32_t u32;
float f;
} fu32;
int main(void ) {
// Lightweight test to see if endian matches and IEEE encoding
assert((fu32) {.u32 = 0x87654321}.f == -1.72477726182e-34f);
float f[] = {0, FLT_TRUE_MIN, FLT_MIN, 1, FLT_MAX};
size_t n = sizeof f/sizeof f[0];
for (size_t i = 0; i<n; i++) {
fu32 x = { .f = f[i] };
float y0 = x.f * 2048.0f;
fu32 y1 = { .u32 = float_2048(x.u32) };
if (memcmp(&y0, &y1.f, sizeof y0)) {
printf("%.9g %.9g\n", y0, y1.f);
}
}
fu32 x = { .u32 = 0 };
do {
fu32 y0 = { .f = isnan(x.f) ? x.f : x.f * 2048.0f };
fu32 y1 = { .u32 = float_2048(x.u32) };
if (memcmp(&y0.f, &y1.f, sizeof y0)) {
printf("%.9g %.9g\n", y0.f, y1.f);
printf("%08lx %08lx %08lx\n", (unsigned long) x.u32,
(unsigned long) y0.u32, (unsigned long) y1.u32);
break;
}
x.u32++;
} while (x.u32 != 0);
puts("Done");
}
Upvotes: 1
Reputation: 51845
As mentioned in the comments, to multiply a floating-point number by a power of 2 (assuming, as is likely, that it is represented in IEEE-754 format), we can just add that power to the (binary) exponent part of the representation.
For a single-precision (32-bit) float
value, that exponent is stored in bits 30-23 and the following code shows how to extract those, add the required value (11, because 2048 = 211), then replace the exponent bits with that modified value.
uint32_t fmul2048(uint32_t f)
{
#define EXPONENT 0x7F800000u
#define SIGN_BIT 0x80000000u
uint32_t expon = (f & EXPONENT) >> 23; // Get exponent value
f &= ~EXPONENT; // Remove old exponent
expon += 11; // Adding 11 to exponent multiplies by 2^11 (= 2048);
if (expon > 254) return EXPONENT | (f & SIGN_BIT); // Too big: return +/- Inf
f |= (expon << 23); // Insert modified exponent
return f;
}
There will, no-doubt, be some "bit trickery" that can be applied to make the code smaller and/or more efficient; but I have avoided doing so in order to keep the code clear. I have also included one error check (for a too large exponent) and the code returns the standard representation for +/- Infinity (all exponent bits set to 1, and keeping the original sign) if that test fails. (I leave other error-checking as an "exercise for the reader".)
Upvotes: 2