Reputation: 1163

casting a float number into int number will result to invalid int

I write below code:

 int vat = (int)(invoice.total * 0.08f);

assume invoice.total = 36000. then vat must be 2880 but is 2879! I changed my code to

float v = invoice.total * 0.08f;
int vat = (int)v;

Now vat has correct value (2880).

I wonder if () has more priority or not! and also float is exact 2880.0 not a little less, so no rounding must happen!

Upvotes: 2

Answers (3)

Jeppe Stig Nielsen

Reputation: 62002

~~A float holds some "hidden" precision that is not shown. Try watching invoice.total.ToString("R"), and you will probably see that it is not exactly 36000.~~

Alternatively, this can be a result of your runtime choosing a "broader" storage location, like a 64-bit or 80-bit CPU register or similar, for the intermediate result invoice.total * 0.08f.

EDIT: You can throw away the effects arising from the runtime choosing a too wide storage location, by changing

(int)(invoice.total * 0.08f)

into

(int)(float)(invoice.total * 0.08f)

The extra cast, from float to float (sic!), looks like a no-op, but it does force the runtime to round and throw away that unwanted precision. This is poorly documented. [Will provide reference.] A related thread you might want to read: Are floating-point numbers consistent in C#? Can they be?

Your example is actually archetypical, so I have decided to go a bit more into detail. This stuff is well described in the section Differences Among IEEE 754 Implementations which is written as an addendum (by an anonymous author) to David Goldberg's What Every Computer Scientist Should Know About Floating-Point Arithmetic. So suppose we have this code:

static int SO_24548957_I()
{
  float t = 36000f; // exactly representable
  float r = 0.08f;  // this is not representable, rounded

  float temporary = t * r;
  int v = (int)temporary;

  return v; // always(?) 2880
}

Everything seems fine, but we decide to refactor the temporary variable away, so we write:

static int SO_24548957_II()
{
  float t = 36000f; // exactly representable
  float r = 0.08f;  // this is not representable, rounded

  int v = (int)(t * r);

  return v; // could be 2880 or 2879 depending on strange things
}

and Bang! the behavior of our program changes. You can see the change on most systems (at least on mine!) if you compile for platform x86 (or Any CPU with Prefer 32-bit selected). Optimizations or not (Release or Debug mode) could be relevant in theory, and the hardware architecture is certainly important too.

It is a complete surprise to many that both 2880 and 2879 can be correct answers on IEEE-754-compliant systems, but read the link I gave.

To elaborate over what is meant by "not representable", let us see what the C# compiler must do when it encounters the symbol 0.08f. Beacuse of the way float (32-bit binary floating point) works, we will have to choose between:

10737418 / 2**27  ==  0.079 999 998 2...

and

10737419 / 2**27  ==  0.080 000 005 6...

where ** means exponentiation (i.e. "to the power of"). Since the first one is nearer to the desired mathematical value, we must choose that one. So the actual value is a bit smaller than the desired one. Now when we do the multiplication and want to store in a Single again we must also, as a part of the multiplication algorithm, round again to yield the product representation which is closest to the exact "mathematical" product of the (actual) factors 36000 and 0.0799999982.... In this case you are lucky that the nearest Single is actually 2880 exactly, so the multiplication process in our case involves a round-up to this value.

Therefore the first code example above gives 2880.

However, in the second code example above, the multiplication might be done (at the choice of the run-time, we cannot really help that) in some CPU hardware that handles many bits (64 or 80, typically). In that case, the product of any two 32-bit floats, like ours, can be calculated without need for rounding the end result, because 64 bits or 80 bits are more than enough to hold the full product of two 32-bit floats. So clearly this product is smaller than 2880 since 0.0799999982... is less than 0.08.

Therefore the second method example above could return 2879.

For comparison, this code:

static int SO_24548957_III()
{
  float t = 36000f; // exactly representable
  float r = 0.08f;  // this is not representable, rounded

  double temporary = t * (double)r;
  int v = (int)temporary;

  return v; // always(?) 2879
}

always give the 2879 because we explicitly tell the compile to convert the Single to Double which means adding a bunch of binary zeroes, so we get to the 2879 case with certainty.

Lessons learned: (1) With binary floating points, fatoring out a sub-expression to a temp variable might change the result. (2) With binary floating points, C# compiler settings like x86 vs. x64 might change the result.

Of course, as everybody says everywhere, do not use float or double for monetary applications; use decimal there.

Upvotes: 2

user2819245

Reputation:

Just an addendum to Jeppe's and David's answer regarding the compiler choosing a different precision of an intermediate value.

Your first expression, written in a function like:

static int Calc1(int value)
{
    float v = value * 0.08f;
    return (int) v;
}

will result in the following IL code:

.method private hidebysig static int32  Calc1(int32 'value') cil managed
{
    // Code size       12 (0xc)
    .maxstack  2
    .locals init ([0] float32 v)
    IL_0000:  ldarg.0
    IL_0001:  conv.r4
    IL_0002:  ldc.r4     7.9999998e-002
    IL_0007:  mul
    IL_0008:  stloc.0
    IL_0009:  ldloc.0
    IL_000a:  conv.i4
    IL_000b:  ret
} // end of method Program::Calc1

Note, that the instructions stloc.0 and ldloc.0 convert the multiplication result to a float before the final conversation to an int (conv.i4) takes place.

Now let's look at your second expression:

static int Calc2(int value)
{
    return (int)(value * 0.08f);
}

and the according IL code:

.method private hidebysig static int32  Calc2(int32 'value') cil managed
{
    // Code size       10 (0xa)
    .maxstack  8
    IL_0000:  ldarg.0
    IL_0001:  conv.r4
    IL_0002:  ldc.r4     7.9999998e-002
    IL_0007:  mul
    IL_0008:  conv.i4
    IL_0009:  ret
} // end of method Program::Calc2

Note that the result of the multiplication is directly converted to an int.

The multiplication result has the precision as provided by the floating point CPU instructions chosen by the JIT compiler, which most likely will exceed the precision of the float format. Thus, the first code incurs an additional loss of precision due to the float conversion of the multiplication result. The second code does not suffer from this additional precision loss, as it avoids the intermediate float conversion.

(Actually, for the first code example the JIT compiler might be smart enough to instruct the CPU to do floating point arithmetic with single precision only, thus already doing the multiplication with the low single precision.)

You might want to argue that the stloc.0 ldloc.0 combo in the IL cod of the first example is pointless and should be optimized away if the compiler would just be smart enough. Alas, this is not the case. Look again at the C# code of the first example. There, the source code explicitly demands that the multiplication result must be converted into a float value (via the variable v). The stloc.0 ldloc.0 combo is merely the way the compiler did choose to adhere to this demanded float conversion.

Upvotes: 0

David Heffernan

Reputation: 613501

0.08f is not exactly representable. The closest single precision value is

0.07999999821186065673828125

So you actually calculate

36000 * 0.07999999821186065673828125

which is just a little less than 2880. You then truncate the value, and hence receive the value 2879.

This might be the first time you have encountered an issue like this, but I bet you were not expecting that the actual value of 0.08f would be 0.07999999821186065673828125.

Consider this variant:

float f = 36000 * 0.08f;
Console.WriteLine((int)f);
double d1 = 36000 * 0.08f;
Console.WriteLine((int)d1);
double d2 = 36000 * 0.08d;
Console.WriteLine((int)d2);

which outputs

2880
2879
2880

Why do your two variants behave differently? Because the compiler is choosing to store an intermediate value for invoice.total * 0.08f to a precision other than single.

Clearly you are playing with fire here. This behaviour is all down to fundamental property of floating point arithmetic. Your choice of binary floating point inevitable leads to issues like this. One way to get around this is to round the values to the nearest integer.

float f = 36000 * 0.08f;
Console.WriteLine((int)Math.Round(f));
double d1 = 36000 * 0.08f;
Console.WriteLine((int)Math.Round(d1));
double d2 = 36000 * 0.08d;
Console.WriteLine((int)Math.Round(d2));

which results in

2880
2879
2880

You might also consider using Decimal for calculations like this. That way you operate on decimal rather than binary representations and so will be able to represent all these values exactly.

int vat = (int)(36000 * 0.08m);
Console.WriteLine(vat);

which outputs

Exactly how to solve the problem depends very much on the details of the calculation and your business logic. But the fundamental issue is that binary floating point cannot represent your calculations exactly.

Upvotes: 1

casting a float number into int number will result to invalid int

Answers (3)

Related Questions