Marek Basovník
Marek Basovník

Reputation: 147

How to operate (fast) on mantissa and exponent part of double or float at c++?

I use c++ for computation of various type of special functions (e.g Lambert function, iteration methods for evaluate inversions etc.). In many cases there is an obviously better approach to work with a mantissa and exponent directly.

I found many answers how to extract mantissa and exponent parts, however all of them was just "academic cases with not very effective speed of computation" that are little bit useless for me (my motivation of operate with mantissa and exponent is in improvement of computational speed). Sometimes i need to call some specific function about billion times (very expensive computing) so every saved computational work is fine. And using "frexp" which return mantissa as double is not very fit.

My questions are (for c++ compiler with IEEE 754 floating point):

1) How to read specific bit of mantissa of float/double?

2) How to read whole mantissa into integer/byte of float/double?

3) The same questions as 1), 2) for exponent.

4) The same questions as 1), 2), 3) for write.

With respect that my motivation is faster computation if I work with mantissa or exponent directly. I suppose that there must be a very simple solution.

Upvotes: 2

Views: 1985

Answers (3)

user113670
user113670

Reputation: 29

In C or C++ if x is an IEEE double then if L is a 64 bit long int, the expression

L = *((long *) &x);

will allow accessing the bits directly. If s is a byte representing the sign (0 = '+', 1 = '-'), e is an integer representing the unbiased exponent, and f is a long int representing the fractional bits then

s = (byte)(L >> 63);

e = ((int)(L >> 52) & 0x7FF) - 0x3FF;

f = (L & 0x000FFFFFFFFFFFFF);

(If f is a normalized number, i.e., not 0, denormal, inf, nor NaN, then the last expression should have 0x0010000000000000 added to it to allow for the implicit high-order 1 bit in IEEE double format.)

Repacking the sign, exponent and fraction back into a double is similar:

L = (s << 63) + ((e + 0x3FF) << 52) + (f & 0x000FFFFFFFFFFFFF);

x = *((double *) &L);

The above code generates only a few machine instructions with no subroutine calls on 64-bit machines compiled with 64-bit code. With 32-bit code there is sometimes a call to do 64-bit arithmetic, but a good compiler will usually generate in-line code. In either case this approach is very fast.

A similar approach works for C# using L = bitConverter.DoubleToInt64Bits(x); and x = BitConverter.Int64BitsToDouble(L); or exactly as above if unsafe code is allowed.

Upvotes: -1

Marcus M&#252;ller
Marcus M&#252;ller

Reputation: 36412

In many cases there is an obviously better approach to work with a mantissa and exponent directly.

I know that feeling all too well from my signal processing work, but the truth is that exponents and mantissas aren't simply usable as separate numbers; IEEE754 specifies quite some special cases, and offsets etc.

I suppose that there must be a very simple solution.

Engineering experience tells me: sentences ending with "a simple solution" aren't true, usually.

"academic cases"

however, is definitely not true (I'll mention an example at the end).

There's very solid real-world usage of optimizations on IEEE754 floats. However, I find that with later x86 processors' abilities to do SIMD (single instruction, multiple data) and the overall fact that floating point is as fast as most "bit-shifty" operations, I generally suspect you're ill-advised to try to do this on a bit level yourself.

Generally, as IEEE754 is a standard, you'll find documentation for how it's stored on your specific architecture everywhere. If you've looked, you should at least have found the wikipedia article explaining how to do 1) and 2) (it's not as static as you seem to think).

What's more important: don't try to be smarter than your compiler. You probably won't be, unless you explicitely know how to vectorize multiple identical operation.

Experiment with your specific compiler's math optimizations. As mentioned, they usually don't do much, nowadays; CPUs aren't slower doing float calculations than they are on integers, necessarily.

I'd rather look at your algorithms and look for potential for optimization there.

Also, while I'm at it, let's pitch VOLK (Vector Optimized Library of Kernels), which is a math library for signal processing, mainly. http://libvolk.org has an overview. Look into the kernels that start with 32f, for example 32f_expfast. You will notice there are different implementations, a generic and CPU-optimized ones, different for each SIMD instruction set.

Upvotes: 6

Pete Becker
Pete Becker

Reputation: 76438

You can copy the address of the fp value into an unsigned char* and treat the resulting pointer as the address of an array that overlays the fp value.

Upvotes: 1

Related Questions