Pascal Cuoq
Pascal Cuoq

Reputation: 80276

Converting double to float without relying on the FPU rounding mode

Does anyone have handy the snippets of code to convert an IEEE 754 double to the immediately inferior (resp. superior) float, without changing or assuming anything about the FPU's current rounding mode?

Note: this constraint probably implies not using the FPU at all. I expect the simplest way to do it in these conditions is to read the bits of the double in a 64-bit long and to work with that.

You can assume the endianness of your choice for simplicity, and that the double in question is available through the d field of the union below:

union double_bits
{
  long i;
  double d;
};

I would try to do it myself but I am certain I would introduce hard-to-notice bugs for denormalized or negative numbers.

Upvotes: 5

Views: 1468

Answers (3)

Mark Lakata
Mark Lakata

Reputation: 20838

I posted code to do this here: https://stackoverflow.com/q/19644895/364818 and copied it below for your convenience.

    // d is IEEE double, but double is not natively supported.
    static float ConvertDoubleToFloat(void* d)
    {
        unsigned long long x;
        float f; // assumed to be IEEE float
        unsigned long long sign ;
        unsigned long long exponent;
        unsigned long long mantissa;

        memcpy(&x,d,8);

        // IEEE binary64 format (unsupported)
        sign     = (x >> 63) & 1; // 1
        exponent = ((x >> 52) & 0x7FF); // 11
        mantissa = (x >> 0) & 0x000FFFFFFFFFFFFFULL; // 52
        exponent -= 1023;

        // IEEE binary32 format (supported)
        exponent += 127; // rebase
        exponent &= 0xFF;
        mantissa >>= (52-23); // left justify

        x = mantissa | (exponent << 23) | (sign << 31);
        memcpy(&f,&x,4);

        return f;
    }

Upvotes: 2

Alok Singhal
Alok Singhal

Reputation: 96141

I think the following works, but I will state my assumptions first:

  • floating-point numbers are stored in IEEE-754 format on your implementation,
  • No overflow,
  • You have nextafterf() available (it's specified in C99).

Also, most likely, this method is not very efficient.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argc, char *argv[])
{
    /* Change to non-zero for superior, otherwise inferior */
    int superior = 0;

    /* double value to convert */
    double d = 0.1;

    float f;
    double tmp = d;

    if (argc > 1)
        d = strtod(argv[1], NULL);

    /* First, get an approximation of the double value */
    f = d;

    /* Now, convert that back to double */
    tmp = f;

    /* Print the numbers. %a is C99 */
    printf("Double: %.20f (%a)\n", d, d);
    printf("Float: %.20f (%a)\n", f, f);
    printf("tmp: %.20f (%a)\n", tmp, tmp);

    if (superior) {
        /* If we wanted superior, and got a smaller value,
           get the next value */
        if (tmp < d)
            f = nextafterf(f, INFINITY);
    } else {
        if (tmp > d)
            f = nextafterf(f, -INFINITY);
    }
    printf("converted: %.20f (%a)\n", f, f);

    return 0;
}

On my machine, it prints:

Double: 0.10000000000000000555 (0x1.999999999999ap-4)
Float: 0.10000000149011611938 (0x1.99999ap-4)
tmp: 0.10000000149011611938 (0x1.99999ap-4)
converted: 0.09999999403953552246 (0x1.999998p-4)

The idea is that I am converting the double value to a float value—this could be less than or greater than the double value depending upon the rounding mode. When converted back to double, we can check if it is smaller or greater than the original value. Then, if the value of the float is not in the right direction, we look at the next float number from the converted number in the original number's direction.

Upvotes: 3

stacker
stacker

Reputation: 68962

To do this job more accurately than just re-combine mantissa and exponent bit's check this out:

http://www.mathworks.com/matlabcentral/fileexchange/23173

regards

Upvotes: 3

Related Questions