Reputation: 149
just wanna clear about these case below:
#define MAP_CELL_SIZE_MIN 0.1f
float mMapHeight = 256;
float mScrHeight = 320;
int mNumRowMax;
case 1:
mNumRowMax = mMapHeight/( MAP_CELL_SIZE_MIN * mScrHeight );
mNumRowMax
is now 7, but actually it must be 8 ( 256/32 ), and if I change the define of MAP_CELL_SIZE_MIN
to only 0.1
then it goes true, mNumRowMax
is 8, so what's wrong with the 'f'
case 2:
float tmp = mMapHeight/( MAP_CELL_SIZE_MIN * mScrHeight );//tmp = 8.0
mNumRowMax = tmp;
mNumRowMax
is now 8, so can anybody help me understand what is wrong with the first case when mNumRowMax
is 7
Upvotes: 1
Views: 5139
Reputation: 183873
What happens is
10 The values of floating operands and of the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.55)
55) The cast and assignment operators must still perform their specific conversions as described in 5.4, 5.2.9 and 5.17.
(C++03; practically identical 6.3.1.8(2) in C99 and the n1570 draft of C11; I'm confident that the gist is identical in C++11.)
In the following, I assume an IEEE-754 like binary floating point representation.
In a fractional hexadecimal notation,
1/10 = 1/2 * 3/15
= 1/2 * 0.33333333333...
= 2^(-4) * 1.999999999...
so when that is rounded to b
bits of precision, you get
2^(-4) * 1.99...9a // if b ≡ 0 (mod 4) or b ≡ 1 (mod 4)
2^(-4) * 1.99...98 // if b ≡ 2 (mod 4) or b ≡ 3 (mod 4)
where the last hex-digit in the fractional part is truncated after the 3,4,1,2 most significant bits respectively.
Now 320 = 2^6*(2^2 + 1)
, so the result of r * 320
where r
is 0.1
rounded to b
bits, is, in full precision (ignoring the power of 2),
6.66...68
+ 1.99...9a
-----------
8.00...02
with b+3
bits for b ≡ 0 (mod 4)
or b ≡ 1 (mod 4)
and
6.66...60
+ 1.99...98
-----------
7.ff...f8
with b+2
bits for b ≡ 2 (mod 4)
or b ≡ 3 (mod 4)
.
In each case, rounding the result to b
bits of precision yields exactly 32 and then you get 256/32 = 8
as a final result. But if the intermediate result with greater precision is used, the calculated result of
256/(0.1 * 320)
is slightly smaller or larger than 8.
With the typical 32-bit float
with 24 (23+1) bits of precision, if the intermediate results are represented with a precision of at least 53 bits:
0.1f = 1.99999ap-4
0.1f * 320 = 32*(1 + 2^(-26))
256/(0.1f * 320) = 8/(1 + 2^(-26)) = 8 * (1 - 2^(-26) + 2^(-52) - ...)
In case 1, the result is directly converted¹ to int
from the intermediate result. Since the intermediate result is slightly smaller than 8, it gets truncated to 7.
In case 2, the intermediate result is stored in a float
before converting to int
, hence it is rounded to 24 bits of precision first, resulting in exactly 8.
Now if you leave off the f
suffix, 0.1
is a double
(presumably with 53 bits of precision), the two float
s are promoted to double
for the calculation, and
0.1 = 1.999999999999ap-4
0.1 * 320 = 32*(1 + 2^(-55))
256/(0.1 * 320) = 8 * (1 - 2^(-55) + 2^(-110) - ...)
If the calculation is performed at double
precision 1 + 2^(-55) == 1
and already 0.1 * 320 == 32
.
If the calculation is performed at extended precision with 64 bits of precision (think x87) or more, it is likely that the literal 0.1
isn't converted to double
precision at all and directly used with the extended precision, which again leads to the multiplication 0.1 * 320
resulting in exactly 32.
If the literal 0.1
is used at double
precision but the calculation is performed at higher precision, it would again yield 7 if the intermediate result is directly truncated to int
from the representation with greater precision and 8 if the excess precision is removed before the conversion to int
.
(Aside: gcc/g++ 4.5.1 yields 8 for all cases, regardless of optimisation level, on my 64-bit box, I haven't tried on a 32-bit box.)
¹ I'm not entirely sure, but I think that's a violation of the standard, it should first remove the excess precision. Any language lawyers?
Upvotes: 2
Reputation: 67175
It appears you are running into rounding errors.
A simple fix might be to use double instead of float.
If that's not an option, then you might need to round to the integer. For example, if you have a floating point value f, do the equivalent of int x = (int)(f + 0.5);
Upvotes: 0
Reputation: 409136
When a floating point number is casted to an integer, the value is truncated and not rounded, i.e. all decimals are just "chopped off".
Upvotes: 0