Reputation: 4337
I currently face the following dilemma:
1.0f * INT32_MAX != INT32_MAX
Evaluating 1.0f * INT32_MAX
actually gives me INT32_MIN
I'm not completely surprised by this, I know floating point to integer conversions aren't always exact.
What is the best way to fix this problem?
The code I'm writing is scaling an array of rational numbers: from -1.0f <= x <= 1.0f
to INT32_MIN <= x <= INT32_MAX
Here's what the code looks like:
void convert(int32_t * dst, const float * src, size_t count){
size_t i = 0;
for (i = 0; i < count; i++){
dst[i] = src[i] * INT32_MAX;
}
}
Here's what I ended up with:
void convert(int32_t * dst, const float * src, size_t count){
size_t i = 0;
for (i = 0; i < count; i++){
double tmp = src[i];
if (src[i] > 0.0f){
tmp *= INT32_MAX;
} else {
tmp *= INT32_MIN;
tmp *= -1.0;
}
dst[i] = tmp;
}
}
Upvotes: 5
Views: 406
Reputation: 46539
In IEEE754, 2147483647 is not representable in a single precision float. A quick test shows that the result of 1.0f * INT32_MAX
is rounded to 2147483648.0f
, which can't be represented in an int.
In other words, it is actually the conversion to int that causes the problem, not the float calculation, which happens to be only 1 off!
Anyway, the solution is to use double
for the intermediate calculation. 2147483647.0 is OK as a double precision number.
Upvotes: 6