MetallicPriest
MetallicPriest

Reputation: 30841

How to get the sign, mantissa and exponent of a floating point number

I have a program, which is running on two processors, one of which does not have floating point support. So, I need to perform floating point calculations using fixed point in that processor. For that purpose, I will be using a floating point emulation library.

I need to first extract the signs, mantissas and exponents of floating point numbers on the processor which do support floating point. So, my question is how can I get the sign, mantissa and exponent of a single precision floating point number.

Following the format from this figure,

enter image description here That is what I've done so far, but except sign, neither mantissa and exponent are correct. I think, I'm missing something.

void getSME( int *s, int *m, int *e, float number )
{
    unsigned int* ptr = (unsigned int*)&number;

    *s = *ptr >> 31;
    *e = *ptr & 0x7f800000;
    *e >>= 23;
    *m = *ptr & 0x007fffff;
}

Upvotes: 75

Views: 151505

Answers (9)

Steve Summit
Steve Summit

Reputation: 48133

How to get the sign, mantissa and exponent of a floating point number?

You're partway there. There are actually three different parts to this problem:

  1. Get at the underlying bits of the floating-point number
  2. Extract the raw sign, exponent, and fraction bits
  3. Convert those raw bits to their scaled representations

You've got a good enough handle on #2. Your approach to #1 might or might not work. And then #3 is probably what you were missing.

(This will be a somewhat long answer. Properly extracting — and interpreting — the components of a floating-point number is a bit trickier than you might think. I'm going to try to cover all the gory details, without leaving anything out.)

There are three good ways of getting at the underlying bits: (a) unions, (b) char pointers, or (c) memcpy. But the approach you've used, (d) pointers to something larger than char (in this case, unsigned int) is not guaranteed to work.

The union approach might look like this:

union fltunion {
    float f;
    uint32_t i;
} u;

u.f = number;
uint32_t i = u.i;

I'll have more to say about subproblem (1) later. For subproblem (2), your code is fine. But it's important to realize that what we're extracting here are the raw bit fields — they are not yet useful, scaled values. So I'm going to store them in separately-named "raw" variables, emphasizing that fact. Also, because it can be clearer, I'm going to shift right, then mask off the bits I want:

unsigned int rawsign = (i >> 31) & 0x01;
unsigned int rawexp  = (i >> 23) & 0xff;
unsigned int rawfrac =  i        & 0x7fffff;

You can also make this clearer still by using an auxiliary function or macro to automatically construct a mask where N gives you the number of '1' bits:

#define MASK(N) ((1u << N) - 1)

unsigned int rawsign = (i >> 31) & MASK(1);
unsigned int rawexp  = (i >> 23) & MASK(8);
unsigned int rawfrac =  i        & MASK(23);

But now comes the important, easy-to-overlook, and slightly tricky part: All three of those raw values, rawsign, rawexp, and rawfrac, need to be explicitly converted to useful, scaled values.

The sign is easy. If it's 0 we have a positive number, or if it's 1 we have a negative number:

int sign;
if(rawsign == 0) sign = 1;
else             sign = -1;

Or of course you could be a C programmer and write

int sign = (rawsign == 0) ? 1 : -1;

The exponent is a bit trickier. It's an 8-bit number, so it can range from 0 to 255. But there are three separate cases. If it's in the slightly narrower range from 1 to 254, then we're dealing with a normal floating-point value. But if the raw exponent field is at its minimum or maximum value, 0 or 255, we're dealing with something special.

Assuming we're dealing with a normal exponent, the scaled exponent is the raw exponent, minus 127:

int exponent = rawexp - 127;

And then we come to the truly magical part. For a normal floating-point number, that bit-field rawfrac we extracted contains the binary fraction bits to the right of the decimal point (actually we should more properly call it the "binary point"). And then to the left of the binary point there's automatically a 1. So if the 23 binary bits in rawfrac are fffffffffffffffffffffff, then our actual mantissa — actually we can more properly call it the "significand" — is:

1.fffffffffffffffffffffff

Study this carefully to make sure you understand it. That "1." part on the left sort of appeared out of thin air. It's commonly called the "hidden 1 bit" or the "implicit 1 bit". The significand of a normal single-precision floating-point number actually has 24 bits of precision, even though only 23 of them are stored as such, because the leading bit is always 1.

And then the other tricky part is that there's no good, obvious way to represent this 'significand' number in C code. Strictly speaking, it's a binary fraction. It is not an integer. It's not a floating-point number, either (if anything it's fixed-point) and even if it could be floating, our task here is to deconstruct a floating-point number, so it would be kind of weird to use a floating-point variable to contain one of our allegedly-deconstructed components. So for the moment we'll store our significand as another unsigned integer, but we'll have to remember that there's actually a binary point in it, between its first and second bits.

Here's how the final code might look. Again, this is only valid for normal values, which I'll make explicit with an if statement on rawexp's value:

int exponent;
unsigned int significand;

if(rawexp > 0 && rawexp < 255) {
    exponent = rawexp - 127;
    significand = (1 << 23) | rawfrac;    
}

So now we've got our scaled sign, exponent, and significand. Theoretically we can recover the floating-point number we originally started with by computing sign × significand × 2exponent, or in C:

sign * significand * pow(2, exponent)

So let's try this. Let's start with a number of 123.125. After running it through the union, we get 0x42f64000, which doesn't look like anything, but that's commonplace with floating-point values. After extracting the raw values, we get

rawsign = 0
rawexp  = 0x85
rawfrac = 0x764000

which don't really look like anything, either.

After converting from raw to scaled values, we get

  sign = 1
  exponent = 6
  significand = 0xf64000 = 16138240

These values are still rather inscrutable, although if you compare the raw fraction bits and the converted significand, 0x764000 versus 0xf64000, you can at least see that one more '1' bit got turned on.

So now let's run these three numbers through that expression I suggested:

sign * significand * pow(2, exponent)

But if you try it, you'll get 1.03285e+09, which is nothing like our original value 123.125. What happened?

I know exactly what "mistake" I made, so I can pull a rabbit out of a hat and point out that 1.03285e+09 ÷ 123.125 is 8388608, and that 8388608 is exactly 223, and that number 23 in there has got to mean something, right?

It does. Remember a few paragraphs back when I said that we'd have to remember that there's actually a binary point in that significand value? Well, here we've basically forgotten that. By computing significand * pow(2, exponent), we're treating significand as an integer, not as a binary fraction with a binary point between the first and second bits.

So how can we fix this? Do we have to convert significand to a fraction somehow? No, there's a trick. In binary, sliding the binary point left or right is equivalent to multiplying or dividing by powers of 2. (This is of course just like the way we can slide the decimal point left or right in decimal numbers, and get the same effect as multiplying or dividing by powers of 10.) So let's look at that significand value.

In our example, we got a value of 0xf64000, or 16138240 in decimal, or 111101100100000000000000 in binary, which we were supposed to interpret as the 24-bit binary fraction

1.11101100100000000000000

with the binary point between the first and second bits, as shown, which works out to 1.923828125.

But we stored it in an integer, and by treating it as one, we acted as if it were the number

111101100100000000000000.0

that is, with the binary point all the way at the right. So by, in effect, shifting the binary point to the right by 23 bits, it was as if we multiplied the significand by 223, and that's precisely why we ended up with a final result that was too large by a factor of 223.

But now that we know what's going on, the fix is easy: just multiply by a little less — or, really, by rather a lot less. We're calling pow to compute a power of two to multiply by, based on the exponent, so all we have to do is subtract 23 there, meaning we multiply by 223 less, and this will exactly cancel out the "mistake" we made by treating significand as an integer. Our final result, then, is

sign * significand * pow(2, exponent - 23)

and if you try it, you'll find that, at last, out pops a value of exactly 123.125, and this confirms that we've extracted the sign, exponent, and significand correctly (as long as we remember to interpret the significand correctly!).

To be very clear, the multiplication we wanted to do was

1 * 1.923828125 * pow(2, 6) = 123.125

while the multiplication we actually did was

1 * 16138240 * pow(2, -17) = 123.125

but in both cases, we get the same answer.

If you're serious about floating point, there's another function in <math.h> called ldexp which is specifically designed for multiplying by powers of 2 when constructing or deconstructing floating-point values. We could use it like this:

sign * ldexpf(significand, exponent - 23)

ldexpf is a variant of ldexp that's specifically for use with float as opposed to double values.

Stay tuned for our next episode, in which we'll cover the "special" values that result when the raw exponent field is at its minimum or maximum value.

Upvotes: 0

ikegami
ikegami

Reputation: 386706

A solution that doesn't use bit fields, since their order is implementation-defined:

float f = ...;

// "Cast" a `float` to a `uint32_t`.
uint32_t i = ( union { float f; uint32_t i; } ){ .f = f }.i;

uint32_t sign     = ( i >> 31 ) & 0x1;
uint32_t exponent = ( i >> 23 ) & 0xFF;
uint32_t fraction =   i         & 0x7FFFFF;

Demo:

#include <assert.h>
#include <inttypes.h>
#include <math.h>
#include <stdint.h>
#include <stdio.h>

// `float` is assumed to be a IEEE single-precision float
// in the same byte order as a uint32_t.

void test( float f ) {
   static_assert( sizeof( float ) == sizeof( uint32_t ) );

   // "Cast" a `float` to a `uint32_t`.
   uint32_t i = ( union { float f; uint32_t i; } ){ .f = f }.i;

   uint32_t sign     = ( i >> 31 ) & 0x1;
   uint32_t exponent = ( i >> 23 ) & 0xFF;
   uint32_t fraction =   i         & 0x7FFFFF;

   if ( exponent == 0xFF ) {
      if ( fraction == 0 ) {
         printf( "%sInf\n", sign ? "-" : "+" );
      } else {
         printf( "A NaN\n" );
      }
   }
   else if ( exponent == 0 ) {
      if ( fraction == 0 ) {
         printf( "%s0\n", sign ? "-" : "+" );
      } else {
         printf( "%s0x0.%06"PRIx32"p-126\n", sign ? "-" : "+", fraction << 1 );
      }
   }
   else {
      int unbiased_exponent = ((int)exponent) - 127;
      printf( "%s0x1.%06"PRIx32"p%+d\n", sign ? "-" : "+", fraction << 1, unbiased_exponent );
   }
}

int main( void ) {
   test( 0.0f );
   test( -0.0f );
   test( -1.0f );
   test( 50.0f );
   test( +0x1.000002p-126f );  // Smallest normal positive number.
   test( -0x1.000002p-126f );  // Smallest normal negative number.
   test( +0x0.000002p-126f );  // Smallest positive number.
   test( -0x0.000002p-126f );  // Smallest negative number.
   test( +INFINITY );
   test( -INFINITY );
   test( NAN );
}
+0
-0
-0x1.000000p+0
+0x1.900000p+5
+0x1.000002p-126
-0x1.000002p-126
+0x0.000002p-126
-0x0.000002p-126
+Inf
-Inf
A NaN

Try it on Compiler Explorer.

Upvotes: 1

AymenTM
AymenTM

Reputation: 569

See this IEEE_754_types.h header for the union types to extract: float, double and long double, (endianness handled). Here is an extract:

/*
** - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
**  Single Precision (float)  --  Standard IEEE 754 Floating-point Specification
*/

# define IEEE_754_FLOAT_MANTISSA_BITS (23)
# define IEEE_754_FLOAT_EXPONENT_BITS (8)
# define IEEE_754_FLOAT_SIGN_BITS     (1)

.
.
.

# if (IS_BIG_ENDIAN == 1)
    typedef union {
        float value;
        struct {
            __int8_t   sign     : IEEE_754_FLOAT_SIGN_BITS;
            __int8_t   exponent : IEEE_754_FLOAT_EXPONENT_BITS;
            __uint32_t mantissa : IEEE_754_FLOAT_MANTISSA_BITS;
        };
    } IEEE_754_float;
# else
    typedef union {
        float value;
        struct {
            __uint32_t mantissa : IEEE_754_FLOAT_MANTISSA_BITS;
            __int8_t   exponent : IEEE_754_FLOAT_EXPONENT_BITS;
            __int8_t   sign     : IEEE_754_FLOAT_SIGN_BITS;
        };
    } IEEE_754_float;
# endif

And see dtoa_base.c for a demonstration of how to convert a double value to string form.

Furthermore, check out section 1.2.1.1.4.2 - Floating-Point Type Memory Layout of the C/CPP Reference Book, it explains super well and in simple terms the memory representation/layout of all the floating-point types and how to decode them (w/ illustrations) following the actually IEEE 754 Floating-Point specification.

It also has links to really really good ressources that explain even deeper.

Upvotes: 2

user246672
user246672

Reputation:

  1. Don't make functions that do multiple things.
  2. Don't mask then shift; shift then mask.
  3. Don't mutate values unnecessarily because it's slow, cache-destroying and error-prone.
  4. Don't use magic numbers.
/* NaNs, infinities, denormals unhandled */
/* assumes sizeof(float) == 4 and uses ieee754 binary32 format */
/* assumes two's-complement machine */
/* C99 */
#include <stdint.h>

#define SIGN(f) (((f) <= -0.0) ? 1 : 0)

#define AS_U32(f) (*(const uint32_t*)&(f))
#define FLOAT_EXPONENT_WIDTH 8
#define FLOAT_MANTISSA_WIDTH 23
#define FLOAT_BIAS ((1<<(FLOAT_EXPONENT_WIDTH-1))-1) /* 2^(e-1)-1 */
#define MASK(width)  ((1<<(width))-1) /* 2^w - 1 */
#define FLOAT_IMPLICIT_MANTISSA_BIT (1<<FLOAT_MANTISSA_WIDTH)

/* correct exponent with bias removed */
int float_exponent(float f) {
  return (int)((AS_U32(f) >> FLOAT_MANTISSA_WIDTH) & MASK(FLOAT_EXPONENT_WIDTH)) - FLOAT_BIAS;
}

/* of non-zero, normal floats only */
int float_mantissa(float f) {
  return (int)(AS_U32(f) & MASK(FLOAT_MANTISSA_BITS)) | FLOAT_IMPLICIT_MANTISSA_BIT;
}

/* Hacker's Delight book is your friend. */

Upvotes: 2

eran
eran

Reputation: 6931

I think it is better to use unions to do the casts, it is clearer.

#include <stdio.h>

typedef union {
  float f;
  struct {
    unsigned int mantisa : 23;
    unsigned int exponent : 8;
    unsigned int sign : 1;
  } parts;
} float_cast;

int main(void) {
  float_cast d1 = { .f = 0.15625 };
  printf("sign = %x\n", d1.parts.sign);
  printf("exponent = %x\n", d1.parts.exponent);
  printf("mantisa = %x\n", d1.parts.mantisa);
}

Example based on http://en.wikipedia.org/wiki/Single_precision

Upvotes: 43

Maxim Egorushkin
Maxim Egorushkin

Reputation: 136525

On Linux package glibc-headers provides header #include <ieee754.h> with floating point types definitions, e.g.:

union ieee754_double
  {
    double d;

    /* This is the IEEE 754 double-precision format.  */
    struct
      {
#if __BYTE_ORDER == __BIG_ENDIAN
    unsigned int negative:1;
    unsigned int exponent:11;
    /* Together these comprise the mantissa.  */
    unsigned int mantissa0:20;
    unsigned int mantissa1:32;
#endif              /* Big endian.  */
#if __BYTE_ORDER == __LITTLE_ENDIAN
# if    __FLOAT_WORD_ORDER == __BIG_ENDIAN
    unsigned int mantissa0:20;
    unsigned int exponent:11;
    unsigned int negative:1;
    unsigned int mantissa1:32;
# else
    /* Together these comprise the mantissa.  */
    unsigned int mantissa1:32;
    unsigned int mantissa0:20;
    unsigned int exponent:11;
    unsigned int negative:1;
# endif
#endif              /* Little endian.  */
      } ieee;

    /* This format makes it easier to see if a NaN is a signalling NaN.  */
    struct
      {
#if __BYTE_ORDER == __BIG_ENDIAN
    unsigned int negative:1;
    unsigned int exponent:11;
    unsigned int quiet_nan:1;
    /* Together these comprise the mantissa.  */
    unsigned int mantissa0:19;
    unsigned int mantissa1:32;
#else
# if    __FLOAT_WORD_ORDER == __BIG_ENDIAN
    unsigned int mantissa0:19;
    unsigned int quiet_nan:1;
    unsigned int exponent:11;
    unsigned int negative:1;
    unsigned int mantissa1:32;
# else
    /* Together these comprise the mantissa.  */
    unsigned int mantissa1:32;
    unsigned int mantissa0:19;
    unsigned int quiet_nan:1;
    unsigned int exponent:11;
    unsigned int negative:1;
# endif
#endif
      } ieee_nan;
  };

#define IEEE754_DOUBLE_BIAS 0x3ff /* Added to exponent.  */

Upvotes: 15

Pietro Braione
Pietro Braione

Reputation: 1153

My advice is to stick to rule 0 and not redo what standard libraries already do, if this is enough. Look at math.h (cmath in standard C++) and functions frexp, frexpf, frexpl, that break a floating point value (double, float, or long double) in its significand and exponent part. To extract the sign from the significand you can use signbit, also in math.h / cmath, or copysign (only C++11). Some alternatives, with slighter different semantics, are modf and ilogb/scalbn, available in C++11; http://en.cppreference.com/w/cpp/numeric/math/logb compares them, but I didn't find in the documentation how all these functions behave with +/-inf and NaNs. Finally, if you really want to use bitmasks (e.g., you desperately need to know the exact bits, and your program may have different NaNs with different representations, and you don't trust the above functions), at least make everything platform-independent by using the macros in float.h/cfloat.

Upvotes: 37

Xymostech
Xymostech

Reputation: 9850

You're &ing the wrong bits. I think you want:

s = *ptr >> 31;
e = *ptr & 0x7f800000;
e >>= 23;
m = *ptr & 0x007fffff;

Remember, when you &, you are zeroing out bits that you don't set. So in this case, you want to zero out the sign bit when you get the exponent, and you want to zero out the sign bit and the exponent when you get the mantissa.

Note that the masks come directly from your picture. So, the exponent mask will look like:

0 11111111 00000000000000000000000

and the mantissa mask will look like:

0 00000000 11111111111111111111111

Upvotes: 12

Alexey Frunze
Alexey Frunze

Reputation: 62106

Find out the format of the floating point numbers used on the CPU that directly supports floating point and break it down into those parts. The most common format is IEEE-754.

Alternatively, you could obtain those parts using a few special functions (double frexp(double value, int *exp); and double ldexp(double x, int exp);) as shown in this answer.

Another option is to use %a with printf().

Upvotes: 26

Related Questions