Reputation: 1321
If we run this code:
#include <iostream>
int main ()
{
using namespace std;
float a = 2.34E+22f;
float b = a+1.0f;
cout<<"a="<<a<<endl;
cout<<"b-a"<<b-a<<endl;
return 0;
}
Then the result will be 0, because float number has only 6 significant digits. But float number 1.0 tries to be added to 23 digit of number. So, how program realizes that there is no place for number 1, what is the algorithm?
Upvotes: 2
Views: 365
Reputation: 180145
When adding two FP numbers, they're first converted to the same exponent. In decimal:
2.34000E+22 + 1.00000E0 = 2.34000E22 + 0.000000E+22
. In this step, the 1.0 is lost to rounding.
Binary floating point works pretty much the same, except that E+22 is replaced by 2^77.
Upvotes: 0
Reputation: 41
Step by step:
IEEE-754 32-bit binary floating-point format:
sign 1 bit significand 23 bits exponent 8 bits
I) float a = 23400000000.f;
Convert 23400000000.f
to float
:
23,400,000,000 = 101 0111 0010 1011 1111 1010 1010 0000 00002 = 1.01011100101011111110101010000000002 • 234.
But the significand can store only 23 bits after the point. So we must round:
1.01011100101011111110101 010000000002 • 234 ≈ 1.010111001010111111101012 • 234
So, after:
float a = 23400000000.f;
a
is equal to 23,399,991,808.
II) float b = a + 1;
a = 101011100101011111110101000000000002. b = 101011100101011111110101000000000012 = 1.01011100101011111110101000000000012 • 234.
But, again, the significand can store only 23 binary digits after the point. So we must round:
1.01011100101011111110101 000000000012 • 234 ≈ 1.010111001010111111101012 • 234
So, after:
float b = a + 1;
b
is equal to 23,399,991,808.
III) float c = b - a;
101011100101011111110101000000000002 - 101011100101011111110101000000000002 = 0
This value can be stored in a float
without rounding.
So, after:
float c = b - a;
с
is equal to 0.
Upvotes: 1
Reputation: 129494
The basic principle is that the two numbers are aligned so that the decimal point is in the same place. I'm using a 10 digit number to make it a little easier to read:
a = 1.234E+10f;
b = a+1.0f;
When calculating a + 1.0f, the decimal points need to be lined up:
1.234E+10f becomes 1234000000.0
1.0f becomes 1.0
+
= 1234000001.0
But since it's float, the 1 on the right is outside the valid range, so the number stored will be 1.234000E+10
- any digits beyond that are lost, because there is just not enough digits.
[Note that if you do this on an optimizing compiler, it may still show 1.0 as a difference, because the floating point unit uses a 64- or 80-bit internal representation, so if the calculation is done without storing the intermediate results in a variable (and a decent compiler can certainly achieve that here) With 2.34E+22f it is guaranteed to not fit in a 64-bit float, and probably not in a 80-bit one either].
Upvotes: 1