nvhausid
nvhausid

Reputation: 149

c++, float to int casting

just wanna clear about these case below:

#define MAP_CELL_SIZE_MIN 0.1f

float mMapHeight = 256;
float mScrHeight = 320;

int mNumRowMax;

case 1:

mNumRowMax = mMapHeight/( MAP_CELL_SIZE_MIN * mScrHeight );

mNumRowMax is now 7, but actually it must be 8 ( 256/32 ), and if I change the define of MAP_CELL_SIZE_MIN to only 0.1 then it goes true, mNumRowMax is 8, so what's wrong with the 'f'

case 2:

float tmp = mMapHeight/( MAP_CELL_SIZE_MIN * mScrHeight );//tmp = 8.0
mNumRowMax = tmp;

mNumRowMax is now 8, so can anybody help me understand what is wrong with the first case when mNumRowMax is 7

Upvotes: 1

Views: 5139

Answers (3)

Daniel Fischer
Daniel Fischer

Reputation: 183873

What happens is

5 [expr]

10 The values of floating operands and of the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.55)

55) The cast and assignment operators must still perform their specific conversions as described in 5.4, 5.2.9 and 5.17.

(C++03; practically identical 6.3.1.8(2) in C99 and the n1570 draft of C11; I'm confident that the gist is identical in C++11.)

In the following, I assume an IEEE-754 like binary floating point representation.

In a fractional hexadecimal notation,

1/10 = 1/2 * 3/15
     = 1/2 * 0.33333333333...
     = 2^(-4) * 1.999999999...

so when that is rounded to b bits of precision, you get

2^(-4) * 1.99...9a   // if b ≡ 0 (mod 4) or b ≡ 1 (mod 4)
2^(-4) * 1.99...98   // if b ≡ 2 (mod 4) or b ≡ 3 (mod 4)

where the last hex-digit in the fractional part is truncated after the 3,4,1,2 most significant bits respectively.

Now 320 = 2^6*(2^2 + 1), so the result of r * 320 where r is 0.1 rounded to b bits, is, in full precision (ignoring the power of 2),

   6.66...68
 + 1.99...9a
 -----------
   8.00...02

with b+3 bits for b ≡ 0 (mod 4) or b ≡ 1 (mod 4) and

   6.66...60
 + 1.99...98
 -----------
   7.ff...f8

with b+2 bits for b ≡ 2 (mod 4) or b ≡ 3 (mod 4).

In each case, rounding the result to b bits of precision yields exactly 32 and then you get 256/32 = 8 as a final result. But if the intermediate result with greater precision is used, the calculated result of

256/(0.1 * 320)

is slightly smaller or larger than 8.

With the typical 32-bit float with 24 (23+1) bits of precision, if the intermediate results are represented with a precision of at least 53 bits:

0.1f = 1.99999ap-4
0.1f * 320 = 32*(1 + 2^(-26))
256/(0.1f * 320) = 8/(1 + 2^(-26)) = 8 * (1 - 2^(-26) + 2^(-52) - ...)

In case 1, the result is directly converted¹ to int from the intermediate result. Since the intermediate result is slightly smaller than 8, it gets truncated to 7.

In case 2, the intermediate result is stored in a float before converting to int, hence it is rounded to 24 bits of precision first, resulting in exactly 8.

Now if you leave off the f suffix, 0.1 is a double (presumably with 53 bits of precision), the two floats are promoted to double for the calculation, and

0.1 = 1.999999999999ap-4
0.1 * 320 = 32*(1 + 2^(-55))
256/(0.1 * 320) = 8 * (1 - 2^(-55) + 2^(-110) - ...)

If the calculation is performed at double precision 1 + 2^(-55) == 1 and already 0.1 * 320 == 32.

If the calculation is performed at extended precision with 64 bits of precision (think x87) or more, it is likely that the literal 0.1 isn't converted to double precision at all and directly used with the extended precision, which again leads to the multiplication 0.1 * 320 resulting in exactly 32.

If the literal 0.1 is used at double precision but the calculation is performed at higher precision, it would again yield 7 if the intermediate result is directly truncated to int from the representation with greater precision and 8 if the excess precision is removed before the conversion to int.

(Aside: gcc/g++ 4.5.1 yields 8 for all cases, regardless of optimisation level, on my 64-bit box, I haven't tried on a 32-bit box.)

¹ I'm not entirely sure, but I think that's a violation of the standard, it should first remove the excess precision. Any language lawyers?

Upvotes: 2

Jonathan Wood
Jonathan Wood

Reputation: 67175

It appears you are running into rounding errors.

A simple fix might be to use double instead of float.

If that's not an option, then you might need to round to the integer. For example, if you have a floating point value f, do the equivalent of int x = (int)(f + 0.5);

Upvotes: 0

Some programmer dude
Some programmer dude

Reputation: 409136

When a floating point number is casted to an integer, the value is truncated and not rounded, i.e. all decimals are just "chopped off".

Upvotes: 0

Related Questions