Reputation: 753
So double precision takes 64 bits in MATLAB. I know that 0 or 1 will take one bit.
But when I type realmax('double') I get a really big number 1.7977e+308. How can this number be saved in only 64 bits?
Would appreciate any clarafication. Thanks.
Upvotes: 0
Views: 292
Reputation: 1283
This is not a MATLAB question. A 64-bit IEEE 754 double-precision binary floating-point format is represented in this format:
bit layout:
| 0 | 1 | 2 | ... | 11 | 12 | 13 | 14 | ... | 63 |
| sign | exponent(E) (11 bit) | fraction (52 bit) |
The first bit is the sign:
0 => +
1 => -
The next 11 bits are used for the representation of the exponent. So we can have integers all the way to +2^10-1 = 1023. Wait... that does not sound good! To represent large numbers, the so-called biased form is used in which the value is represented as:
2^(E-1023)
where E is what the exponent represents. Say, The exponent bits are like these examples:
Bit representation of the exponent:
Bit no: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
Example 1: | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Example 2: | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Example 3: | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
Example 4: | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
Example 5: | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Base 10 representation:
Example 1 => E1: 1
Example 2 => E2: 32
Example 3 => E3: 515
Example 4 => E4: 2046
Example 5 => E4: Infinity or NaN (**** Special case ****)
Biased form:
Example 1 => 2^(E1-1023) = 2^-1022 <= The smallest possible exponent
Example 2 => 2^(E2-1023) = 2^-991
Example 3 => 2^(E3-1023) = 2^-508
Example 4 => 2^(E4-1023) = 2^+1023 <= The largest possible exponent
Example 5 => 2^(E5-1023) = Infinity or NaN
When E meets 0 < E < 2047 then the number is known as a normalized number represented by:
Number = (-1)^sign * 2^(E-1023) * (1.F)
but if E is 0, then the number if known as a denormalized number represented by:
Number = (-1)^sign * 2^(E-1022) * (0.F)
Now F is basically the what is determined by the fraction bits:
// Sum over i = 12, 13, ..... , 63
F = sum(Bit(i) * 2^(-i))
and Bit(i) refers the ith bit of the number. Examples:
Bit representation of the fraction:
Bit no: | 12 | 13 | 14 | 15 | ... ... ... ... | 62 | 63 |
Example 1: | 0 | 0 | 0 | 0 | 0 ... .... 0 | 0 | 1 |
Example 2: | 1 | 0 | 0 | 0 | 0 ... .... 0 | 0 | 0 |
Example 3: | 1 | 1 | 1 | 1 | 1 ... .... 1 | 1 | 1 |
F value assuming 0 < E < 2047:
Example 1 => 1.F1 = 1 + 2^-52
Example 2 => 1.F2 = 1 + 2^-1
Example 3 => 1.F3 = 1 + 1 - 2^-52
But when I type realmax('double') I get a really big number 1.7977e+308. How can this number be saved in only 64 bits?
realmax('double')
's binary representation is
| sign | exponent(E) (11 bit) | fraction (52 bit) |
0 11111111110 1111111111111111111111111111111111111111111111111111
Which is
+2^1023 x (1 + (1-2^-52)) = 1.79769313486232e+308
I took some definitions and examples from this Wikipedia page.
Upvotes: 3