Sunrise
Sunrise

Reputation: 1321

How does floating-point arithmetic work when one is added to a big number?

If we run this code:

 #include <iostream>
 int main ()
 { 
     using namespace std;
     float a = 2.34E+22f;
     float b = a+1.0f;  
     cout<<"a="<<a<<endl;
     cout<<"b-a"<<b-a<<endl;
     return 0;
 }

Then the result will be 0, because float number has only 6 significant digits. But float number 1.0 tries to be added to 23 digit of number. So, how program realizes that there is no place for number 1, what is the algorithm?

Upvotes: 2

Views: 365

Answers (3)

MSalters
MSalters

Reputation: 180145

When adding two FP numbers, they're first converted to the same exponent. In decimal: 2.34000E+22 + 1.00000E0 = 2.34000E22 + 0.000000E+22. In this step, the 1.0 is lost to rounding.

Binary floating point works pretty much the same, except that E+22 is replaced by 2^77.

Upvotes: 0

mustitz
mustitz

Reputation: 41

Step by step:

IEEE-754 32-bit binary floating-point format:

    sign         1 bit
    significand 23 bits
    exponent     8 bits

I) float a = 23400000000.f;

Convert 23400000000.f to float:

23,400,000,000 = 101 0111 0010 1011 1111 1010 1010 0000 00002
               = 1.01011100101011111110101010000000002 • 234.

But the significand can store only 23 bits after the point. So we must round:

  1.01011100101011111110101 010000000002 • 234
≈ 1.010111001010111111101012 • 234

So, after:

float a = 23400000000.f;

a is equal to 23,399,991,808.

II) float b = a + 1;

a = 101011100101011111110101000000000002.
b = 101011100101011111110101000000000012
  = 1.01011100101011111110101000000000012 • 234.

But, again, the significand can store only 23 binary digits after the point. So we must round:

  1.01011100101011111110101 000000000012 • 234
≈ 1.010111001010111111101012 • 234

So, after:

float b = a + 1;

b is equal to 23,399,991,808.

III) float c = b - a;

101011100101011111110101000000000002 - 101011100101011111110101000000000002 = 0

This value can be stored in a float without rounding.

So, after:

float c = b - a;

с is equal to 0.

Upvotes: 1

Mats Petersson
Mats Petersson

Reputation: 129494

The basic principle is that the two numbers are aligned so that the decimal point is in the same place. I'm using a 10 digit number to make it a little easier to read:

 a = 1.234E+10f;
 b = a+1.0f;

When calculating a + 1.0f, the decimal points need to be lined up:

 1.234E+10f becomes 1234000000.0
 1.0f       becomes          1.0
            + 
            =       1234000001.0

But since it's float, the 1 on the right is outside the valid range, so the number stored will be 1.234000E+10- any digits beyond that are lost, because there is just not enough digits.

[Note that if you do this on an optimizing compiler, it may still show 1.0 as a difference, because the floating point unit uses a 64- or 80-bit internal representation, so if the calculation is done without storing the intermediate results in a variable (and a decent compiler can certainly achieve that here) With 2.34E+22f it is guaranteed to not fit in a 64-bit float, and probably not in a 80-bit one either].

Upvotes: 1

Related Questions