Reputation: 11

Floating point operation in c

We know that in C, the floating point range is from 1.xxxx * 10^-38 to 3.xxxx *10^38 for single precision.

On my lecture slides there is this operation:

(10^10 + 10^30) + (-10^30) ?= 10^10 + (10^30 + -10^30)
10^30 - 10^30 ?= 10^10 + 0

I'm wondering why 10^10 + 10^30 = 10^30 in this case?
What I thought is, since the range of FP can go down to 10^-38 and up to 10^38, there shouldn't be an overflow, so`10^10 + 10^30 shouldn't end up being 10^30.

Upvotes: 0

Answers (2)

nodakai

Reputation: 8033

The essence is the notion of significant digits. It's roughly 7 decimal digits for IEEE754 float. If we use hypothetical decimal floating point numbers with 7 significant digits, the calculation is done in this way:

10^10 + 10^30 == 1.000 000 * 10^10 + 1.000 000 * 10^30
              == (0.000 000 000 000 000 000 01 + 1.000 000) * 10^30 (match the exponent part)
              => (0.000 000 + 1.000 000) * 10^30 (round the left operand)
              ==  1.000 000 * 10^30
              == 10^30

Note however that the matching operation and the rounding operation are done as a single step, ie. the machine can never deal with 0.000 000 000 000 000 000 01 * 10^30 which has too many significant digits.

By the way, if you conduct experiments on floating point arithmetics in C, you may find %a format specifier useful (introduced in C99.) But note that printf always implicitly converts float arguments to double.

#include <stdio.h>

int main() {
    float x = 10e10, y = 10e30;
    printf("(%a + %a) == %a == %a\n", x, y, x+y, y);
    return 0;
}

http://ideone.com/WeXe22

Upvotes: 0

da_steve101

Reputation: 283

As said in the comment to your question the part which store the digits is finite. It is referred to as the significand.

Consider the following simple 14 bit format:

[sign bit] [ 5 bit exponent] [ 8 bit significand]

let 'bias' be 16, ie if the exponent is 16 it is actually 0 (so we get a good range or +/- powers) and no implied bits

so if we have numbers greater than 2^8 apart like 2048 and 0.5

in our format:

2048 = 2^11 = [0][11011][1000 0000]

0.5 = 2^-1 = [0][01111][1000 0000]

when we add these numbers we shift the exponent so that they have the same decimal places. A decimal analogy is:

5 x 10 ^ 3 + 5 x 10 ^ -2 => 5 x 10^3 + 0.00005 x 10 ^ 3

so the siginifcand cant hold 12 places:

2 ^ 11 + 0.000000000001 x 2 ^ 11 = 1.000000000001 x 2 ^ 11

so it rounds back to 2 ^ 11

Upvotes: 2

Floating point operation in c

Answers (2)

Related Questions