Spencer D
Spencer D

Reputation: 3486

Accuracy of Adding Floats vs. Multiplying Float by Integer

In my computer science course, we are doing a study of floating point numbers and how they are represented in memory. I already understand how they are represented in memory (the mantissa/significand, the exponent and its bias, and the sign bit), and I understand how floats are added and subtracted from each other (denormalization and all of that fun stuff). However, while looking over some study questions, I noticed something that I cannot explain.

When a float that cannot be precisely represented is added to itself several times, the answer is lower than we would mathematically expect, but when that same float is multiplied by an integer, the answer, comes out precisely to the correct number.

Here is an example from our study questions (the example is written in Java, and I have edited it down for simplicity):

float max = 10.0f; /* Defined outside the function in the original code */
float min = 1.0f; /* Defined outside the function in the original code */
int count = 10; /* Passed to the function in the original code */
float width = (max - min) / count;
float p = min + (width * count);

In this example, we are told that the result comes out to exactly 10.0. However, if we look at this problem as a sum of floats, we get a slightly different result:

float max = 10.0f; /* Defined outside the function in the original code */
float min = 1.0f; /* Defined outside the function in the original code */
int count = 10; /* Passed to the function in the original code */
float width = (max - min) / count;

for(float p=min; p <= max; p += width){
    System.out.printf("%f%n", p);
}

We are told that the final value of p in this test is ~9.999999 with a difference of -9.536743E-7 between the last value of p and the value of max. From a logical standpoint (knowing how floats work), this value makes sense.

The thing that I do not understand, though, is why we get exactly 10.0 for the first example. Mathematically, it makes sense that we would get 10.0, but knowing how floats are stored in memory, it does not make sense to me. Could anyone explain why we get a precise and exact value by multiplying an imprecise float with an int?

EDIT: To clarify, in the original study questions, some of the values are passed to the function and others are declared outside of the function. My example codes are shortened and simplified versions of the study question examples. Because some of the values are passed into the function rather than being explicitly defined as constants, I believe simplification/optimization at compile time can be ruled out.

Upvotes: 3

Views: 2603

Answers (2)

tmyklebu
tmyklebu

Reputation: 14205

First, some nitpicking:

When a float that cannot be precisely represented

There is no "float that cannot be precisely represented." All floats can be precisely represented as floats.

is added to itself several times, the answer is lower than we would mathematically expect,

When you add a number to itself several times, you can actually get something higher than you might expect. I will use C99 hexfloat notation. Consider f = 0x1.000006p+0f. Then f+f = 0x1.000006p+1f, f+f+f = 0x1.800008p+1f, f+f+f+f = 0x1.000006p+2f, f+f+f+f+f = 0x1.400008p+2f, f+f+f+f+f+f = 0x1.80000ap+2f, and f+f+f+f+f+f+f = 0x1.c0000cp+2f. However, 7.0*f = 0x1.c0000a8p+2, which rounds to 0x1.c0000ap+2f, less than f+f+f+f+f+f+f.

but when that same float is multiplied by an integer, the answer, comes out precisely to the correct number.

7 * 0x1.000006p+0f cannot be represented as an IEEE float. It therefore gets rounded. With the default rounding mode of round-to-nearest-with-ties-going-to-even, you get the closest float to your exact result when you do a single arithmetic operation like this.

The thing that I do not understand, though, is why we get exactly 10.0 for the first example. Mathematically, it makes sense that we would get 10.0, but knowing how floats are stored in memory, it does not make sense to me. Could anyone explain why we get a precise and exact value by multiplying an imprecise float with an int?

To answer your question, you get different results because you did different operations. It's a bit of a fluke that you got the "right" answer here.

Let's switch the numbers around. If I compute 0x1.800002p+0f / 3, I get 0x1.00000155555...p-1, which rounds to 0x1.000002p-1f. When I triple that, I get 0x1.800003p+0f, which rounds (since we break ties to even) to 0x1.800004p+0f. This is the same result as I'd get if I compute f+f+f in float arithmetic where f = 0x1.000002p-1f.

Upvotes: 4

Rob11311
Rob11311

Reputation: 1416

Because 1.0 + ((10.0 - 1.0) / 10.0) * 10.0 does only 1 calculation with inexact values, thus 1 rounding error, it is more accurate than doing 10 additions of float's representation of 0.9f. I think that is the principal which is intended to be taught in this example.

The key issue is that 0.1 cannot be represented exactly in floating point. So 0.9 has errors in it, which add up in the function loop.

The "exact" number, is probably shown so because of a clever output formatting routine. When I first used computers, they loved to put such numbers out in an absurd scientific fixed digit format, which was not human friendly.

I think to understand what's going on I'll find Koenig's Dr Dobbs blog post on this topic, it's an enlightening read, the series culiminates by showing how languages like perl, python & probably java make calculations look exact if they're precise enough.

Koenig's Dr Dobbs article on floating point

Even Simple Floating-Point Output Is Complicated

Don't be too surprised if fixed point arithmetic gets added to CPUs 5-10 years out, financial people like sums to be exact.

Upvotes: 2

Related Questions