tay10r
tay10r

Reputation: 4337

Integer Conversion in Floating Point Arithmetic

I currently face the following dilemma:

1.0f * INT32_MAX != INT32_MAX

Evaluating 1.0f * INT32_MAX actually gives me INT32_MIN

I'm not completely surprised by this, I know floating point to integer conversions aren't always exact.

What is the best way to fix this problem?

The code I'm writing is scaling an array of rational numbers: from -1.0f <= x <= 1.0f to INT32_MIN <= x <= INT32_MAX

Here's what the code looks like:

void convert(int32_t * dst, const float * src, size_t count){
    size_t i = 0;
    for (i = 0; i < count; i++){
        dst[i] = src[i] * INT32_MAX;
    }
}

Here's what I ended up with:

void convert(int32_t * dst, const float * src, size_t count){
    size_t i = 0;
    for (i = 0; i < count; i++){
        double tmp = src[i];
        if (src[i] > 0.0f){
            tmp *= INT32_MAX;
        } else {
            tmp *= INT32_MIN;
            tmp *= -1.0;
        }
        dst[i] = tmp;
    }
}

Upvotes: 5

Views: 406

Answers (1)

Mr Lister
Mr Lister

Reputation: 46539

In IEEE754, 2147483647 is not representable in a single precision float. A quick test shows that the result of 1.0f * INT32_MAX is rounded to 2147483648.0f, which can't be represented in an int.

In other words, it is actually the conversion to int that causes the problem, not the float calculation, which happens to be only 1 off!

Anyway, the solution is to use double for the intermediate calculation. 2147483647.0 is OK as a double precision number.

Upvotes: 6

Related Questions