double precision in MATLAB

Question

So double precision takes 64 bits in MATLAB. I know that 0 or 1 will take one bit.

But when I type realmax('double') I get a really big number 1.7977e+308. How can this number be saved in only 64 bits?

Would appreciate any clarafication. Thanks.

If_You_Say_So · Accepted Answer

This is not a MATLAB question. A 64-bit IEEE 754 double-precision binary floating-point format is represented in this format:

bit layout:
|   0   |   1   |   2   |   ...   |   11   |   12   |   13   |   14   |   ...   |   63   |
|  sign |     exponent(E) (11 bit)         |  fraction         (52 bit)                  |

The first bit is the sign:

0 => +
1 => -

The next 11 bits are used for the representation of the exponent. So we can have integers all the way to +2^10-1 = 1023. Wait... that does not sound good! To represent large numbers, the so-called biased form is used in which the value is represented as:

2^(E-1023)

where E is what the exponent represents. Say, The exponent bits are like these examples:

Bit representation of the exponent:
Bit no:     |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |  9  |  10  |  11  |

Example 1:  |  0  |  0  |  0  |  0  |  0  |  0  |  0  |  0  |  0  |  0   |  1   |
Example 2:  |  0  |  0  |  0  |  0  |  0  |  1  |  0  |  0  |  0  |  0   |  0   |
Example 3:  |  0  |  1  |  0  |  0  |  0  |  0  |  0  |  0  |  0  |  1   |  1   |
Example 4:  |  1  |  1  |  1  |  1  |  1  |  1  |  1  |  1  |  1  |  1   |  0   |
Example 5:  |  1  |  1  |  1  |  1  |  1  |  1  |  1  |  1  |  1  |  1   |  1   |

Base 10 representation:
Example 1 => E1: 1
Example 2 => E2: 32
Example 3 => E3: 515
Example 4 => E4: 2046
Example 5 => E4: Infinity or NaN (**** Special case ****)

Biased form:
Example 1 => 2^(E1-1023) = 2^-1022 <= The smallest possible exponent 
Example 2 => 2^(E2-1023) = 2^-991
Example 3 => 2^(E3-1023) = 2^-508
Example 4 => 2^(E4-1023) = 2^+1023 <= The largest possible exponent 
Example 5 => 2^(E5-1023) = Infinity or NaN

When E meets 0 < E < 2047 then the number is known as a normalized number represented by:

Number = (-1)^sign * 2^(E-1023) * (1.F)

but if E is 0, then the number if known as a denormalized number represented by:

Number = (-1)^sign * 2^(E-1022) * (0.F)

Now F is basically the what is determined by the fraction bits:

// Sum over i = 12, 13, ..... , 63
F = sum(Bit(i) * 2^(-i))

and Bit(i) refers the ith bit of the number. Examples:

Bit representation of the fraction:
Bit no:     |  12  |  13  |  14  |  15  |  ... ... ... ...   |  62  |  63  |

Example 1:  |  0   |  0   |  0   |   0  |  0  ... ....   0   |  0   |  1   |
Example 2:  |  1   |  0   |  0   |   0  |  0  ... ....   0   |  0   |  0   |
Example 3:  |  1   |  1   |  1   |   1  |  1  ... ....   1   |  1   |  1   |

F value assuming 0 < E < 2047:
Example 1 => 1.F1 = 1 + 2^-52
Example 2 => 1.F2 = 1 + 2^-1
Example 3 => 1.F3 = 1 + 1 - 2^-52

But when I type realmax('double') I get a really big number 1.7977e+308. How can this number be saved in only 64 bits?

realmax('double')'s binary representation is

|  sign | exponent(E) (11 bit) |            fraction         (52 bit)                 |

   0            11111111110       1111111111111111111111111111111111111111111111111111

Which is

+2^1023 x (1 + (1-2^-52)) = 1.79769313486232e+308

I took some definitions and examples from this Wikipedia page.

double precision in MATLAB

Answers (1)

Related Questions