Oybek
Oybek

Reputation: 7243

How double precision floating point number is stored and calculated?

I'm really curious about how Double Precision Floating point number is stored.

These are things I figured out so far.

  1. They require 64 bits in memory
  2. Consist of three parts
    • Sign bit (1 bit long)
    • Exponent (11 bit long)
    • Fraction (53 bits, the first bit is assumed always to be 1, thus only 52 are stored, except when all 52 bits are 0. Then leading bit is assumed to be 0)

However I do not uderstand what is exponent, exponent bias and all those formulas in wikipedia page.

Can anyone explain me what are all those things, how they work and eventually calculated to the real number step by step?

Upvotes: 3

Views: 3698

Answers (3)

s.zen
s.zen

Reputation: 11

    int main()
    {
         double num = 5643.0662;
         int sign = 0;
         int exponent = 1035;
         int exponent_bias = 1023;
         float mantissa = 0.0662;

          double x = pow(-1,sign) * pow(2,(exponent - exponent_bias)) * (1+mantissa);
         int y = num - x;

       cout << "\nValue of x is : " << x << endl;
       cout << "\nValue of y is : " << y << endl;

      return 0;
  }

Upvotes: 0

zambotn
zambotn

Reputation: 785

  • Sign: 1 if negative 0 if positive
  • Fraction: the engeneering floating rappresentation in binary mode.
  • Exponent: is the exponent e such that fraction * 2^e is equal to the number that i want to rappresent.
  • The bias is a number that must be subtracted to the exponent to have the correct rappresentation. In double precision is 1023, in single precision 127.

an example (in single precision couse is more comfortable for me to write =)): if i had to rappresent -0.75 i do: - binary rappresentation will be -11 * 2^-2 = -1.1 * 2^-1

  • sign = 1
  • fraction = 1 + .1000....
  • biassed exponent: -1 + 127 = 126 -> 01111110

so we had -0.75 = 1 01111110 10000000000000000000000

For the sum you have to align the exponent and then you can sum the fracional part.

For multiplication you have to

  • sum the exponent and subracting the bias
  • multuply the fractional part
  • rounding the result
  • look at the sign (if you have same sign so sign = 0 else sign = 1)

Upvotes: 1

Richard Pennington
Richard Pennington

Reputation: 19965

Check out the formula a little further down the page:

Except for the above exceptions, the entire double-precision number is described by:

(-1)^sign * 2^(exponent - bias) * 1.mantissa

The formula means that for non-NAN, non-INF, non-zero and non-denormal numbers (which I'll ignore) you take the bits in the mantissa and add an implicit 1 bit at the top. This makes the mantissa 53 bits in the range 1.0 ... 1.111111...11 (binary). To get the actual value, you multiply the mantissa by the 2 to the power of the exponent minus the bias (1023) and either negate the result or not depending on the sign bit. The number 1.0 would have an unbiased exponent of zero (i.e. 1.0 = 1.0 * 2^0) and its biased exponent would be 1023 (the bias is just added to the exponent). So, 1.0 would be sign = 1, exponent = 1023, mantissa = 0 (remember the hidden mantissa bit).

Putting it all together in hexadecimal the value would be 0x3FF000000000 == 1.0.

Upvotes: 2

Related Questions