\nwhere e
becomes 1 only if it's 000
, if it's 111
and mantissa is 0000
, then it's infinity, and if it's 111
and mantissa is XXXX
, then it's not a number.
What I don't understand is
\n-126
and 127
, inclusively? How are total possible\n254 values sectioned as the inclusive values?127
selected as the bias value?[-126..127]
but some [-125...128]. It is really intricate and perplexing.2^{-126}
if not the second aforementioned source? If it is 2^{-125}
? (I have not be able to run my brain to get it understand till now though struggling :)2^{e%127}
? (the correction thanks to chux)Reputation: 20931
I have realized that I didn't heed duly the floating-point part of IEEE 754 standard as sitting my university desks. However, even if I'm not currently struggling with embedded stuff, I feel incompetent myself and incapable of entitling to be engineer title for lack of some way of math-calculations and wholly grasping the standard.
What I know is
0
and 255
are special values to express 0
and infinity
values.
There is implicit 1
to be used to express 23bit
as 24
where
e
becomes 1 only if it's 000
, if it's 111
and mantissa is 0000
, then it's infinity, and if it's 111
and mantissa is XXXX
, then it's not a number.
What I don't understand is
-126
and 127
, inclusively? How are total possible
254 values sectioned as the inclusive values?127
selected as the bias value?[-126..127]
but some [-125...128]. It is really intricate and perplexing.2^{-126}
if not the second aforementioned source? If it is 2^{-125}
? (I have not be able to run my brain to get it understand till now though struggling :)2^{e%127}
? (the correction thanks to chux)Upvotes: 1
Views: 1520
Reputation: 224267
How can we mention
-126
and127
, inclusively? How are total possible 254 values sectioned as the inclusive values?
IEEE 754-2008 3.3 says emin, the minimum exponent, for any format shall be 1−emax, where emax is the maximum exponent. Table 3.2 in that clause says emax for the 32-bit format (named “binary32”) shall be 127. So emin is 1−127 = −126.
There is no mathematical constraint that forces this. The relationship is chosen as a matter of preference. I recall there being a desire to have slightly more positive exponents than negative but do not recall the justification for this.
Why is
127
selected as the bias value?
Once the bounds above are selected, 127 is necessarily the value needed to encode them in eight bits (as codes 1-254 while leaving 0 and 255 as special codes).
Some sources explain the sectionization as
[-126..127]
but some[-125...128]
. It is really intricate and perplexing.
Given bits of a binary32 that are the sign bit S, the eight exponent bits E (which are a binary representation of a number e), and the 23 significand bits F (which are a binary representation of a number f), and given 0 < e < 255, then the following are equivalent to each other:
The difference between the first two is just that the first takes the significand bits F, treats them as a binary numeral to get a number f, then divides that number by 223 and adds 1, whereas the second uses the 23 significand bits F to write a 24-bit numeral “1.F”, which it then interprets as a binary numeral. These two methods produce the same value.
The difference between the first pair and the second pair is that the first prepares a significand in the half-open interval [1, 2), whereas the second prepares a significand in the half-open interval [½, 1) and adjusts the exponent to compensate. The product is the same.
The difference between the first pair and the third pair is also one of scaling. The third pair scales the significand so that it is an integer. The first form is most commonly seen in discussions of floating-point numbers, but the third form is useful for mathematical proofs because number theory generally works with integers. This form is also mentioned in IEEE 754 in passing, also in clause 3.3.
How can we say the minimum
2^{-126}
if not the second aforementioned source? If it is2^{-125}
? (I have not be able to run my brain to get it understand till now though struggling :)
The minimum positive normal value has S bit 0, E bits 00000001, and F bits 00000000000000000000000. In the first form, this represents +1 • 21−127 • 1 = 2−126. In the second form, it represents +1 • 21−126 • ½ = 2−126. In the third form, it represents +1 • 21-150 • 223 = 2−126. So the form is irrelevant; the values represented are the same.
Isn't using remainder operator more logical with the bias value instead of subtraction i.e.
2^{e%127}
?
No. That would cause the exponent field values 1 and 128 to map to the same value, and that would waste some encodings. There is no benefit to that.
Additionally, the encoding format is such that all positive floating-point numbers are in the same order as their encodings: Increasing the encoding increases the value represented, and vice-versa. This relationship would not old with any sort of wrapped interpretation of the exponent field. (Unfortunately, this is flipped for negative numbers, so compare the encodings of floating-point numbers as pure integers does not give the same results as comparing the floating-point numbers.)
Upvotes: 1
Reputation: 51903
Exponent range
for 32bit float the raw exponent rexp
is 8 bit <0,255>
and bias is 127
. Excluding special cases { 0,255 }
we got <1,254>
applying bias:
expmin = 1-127 = -126
expmax = 254-127 = +127
Denormal values are without implicit 1 so for minimal number the mantisa is 1
and if the exponent should point to lsb of mantisa then we need to shift few more:
expmin = 0-127-(23-1) = -149
Normal max value will be with maximal mantisa so:
max = ((2^24)-1)*(2^127) = (2^24)*(2^127) - (2^127) = 2^151 - 2^127
so the real range (denormals included) of float
is:
<2^-149 ,2^+151 )
<1.40e-45,2.85e+45)
In most specs and docs only the exponent for normalized numbers is shown so:
<2^-126 ,2^+127 >
<1.175e-38,1.701e38>
Here a small C++/VCL example of disecting the 32 and 64 bit floats:
//$$---- Form CPP ----
//---------------------------------------------------------------------------
#include <vcl.h>
#include <math.h>
#pragma hdrstop
#include "Unit1.h"
//---------------------------------------------------------------------------
#pragma package(smart_init)
#pragma resource "*.dfm"
TForm1 *Form1;
//---------------------------------------------------------------------------
typedef unsigned __int32 U32;
typedef __int32 S32;
//---------------------------------------------------------------------------
// IEEE 754 double MSW masks
const U32 _f64_sig =0x80000000; // sign
const U32 _f64_exp =0x7FF00000; // exponent
const U32 _f64_exp_sig=0x40000000; // exponent sign
const U32 _f64_exp_bia=0x3FF00000; // exponent bias
const U32 _f64_exp_lsb=0x00100000; // exponent LSB
const U32 _f64_exp_pos= 20; // exponent LSB bit position
const U32 _f64_man =0x000FFFFF; // mantisa
const U32 _f64_man_msb=0x00080000; // mantisa MSB
const U32 _f64_man_bits= 52; // mantisa bits
const double _f64_lsb = 1.7e-308; // abs min number
// IEEE 754 single masks <2^-149,2^+151) <1.40e-45,2.85e+45).
const U32 _f32_sig =0x80000000; // sign
const U32 _f32_exp =0x7F800000; // exponent
const U32 _f32_exp_sig=0x40000000; // exponent sign
const U32 _f32_exp_bia=0x3F800000; // exponent bias
const U32 _f32_exp_lsb=0x00800000; // exponent LSB
const U32 _f32_exp_pos= 23; // exponent LSB bit position
const U32 _f32_man =0x007FFFFF; // mantisa
const U32 _f32_man_msb=0x00400000; // mantisa MSB
const U32 _f32_man_bits= 23; // mantisa bits
const float _f32_lsb = 3.4e-38;// abs min number
//---------------------------------------------------------------------------
void f64_disect(double x)
{
const int h=1; // may be platform dependent MSB/LSB order
const int l=0;
union _f64
{
double f; // 64bit floating point
U32 u[2]; // 2x32 bit uint
} f64;
AnsiString txt="";
U32 man[2];
S32 exp,bias;
char sign='+';
f64.f=x;
bias=_f64_exp_bia>>_f64_exp_pos;
if (f64.u[h]&_f64_sig) sign='-';
exp =(f64.u[h]&_f64_exp)>>_f64_exp_pos;
exp -=bias;
man[h]=f64.u[h]&_f64_man;
man[l]=f64.u[l];
if (exp==-bias ) // zero, denormalized
{
exp-=_f64_man_bits-1; // change exp pointing from msb to lsb (ignoring implicit bit)
txt=AnsiString().sprintf("%c%06X%08Xh>>%4i",sign,man[h],man[l],-exp);
}
else if (exp==+bias+1) // Inf,NaN
{
if (man[h]|man[l]==0) txt=AnsiString().sprintf("%cInf ",sign);
else txt=AnsiString().sprintf("%cNaN ",sign);
man[h]=0; man[l]=0; exp=0;
}
else{
exp -=_f64_man_bits; // change exp pointing from msb to lsb
man[h]|=_f64_exp_lsb; // implicit msb mantisa bit for normalized numbers
txt=AnsiString().sprintf("%06X",man);
if (exp<0) txt=AnsiString().sprintf("%c%06X%08Xh>>%4i",sign,man[h],man[l],-exp);
else txt=AnsiString().sprintf("%c%06X%08Xh<<%4i",sign,man[h],man[l],+exp);
}
// reconstruct man,exp back to double
double y=double(man[l])*pow(2.0,exp);
y+=double(man[h])*pow(2.0,exp+32.0);
Form1->mm_log->Lines->Add(AnsiString().sprintf("%21.10lf = %s = %21.10lf",x,txt,y));
}
//---------------------------------------------------------------------------
void f32_disect(double x)
{
union _f32 // float bits access
{
float f; // 32bit floating point
U32 u; // 32 bit uint
} f32;
AnsiString txt="";
U32 man;
S32 exp,bias;
char sign='+';
f32.f=x;
bias=_f32_exp_bia>>_f32_exp_pos;
if (f32.u&_f32_sig) sign='-';
exp =(f32.u&_f32_exp)>>_f32_exp_pos;
exp-=bias;
man =f32.u&_f32_man;
if (exp==-bias ) // zero, denormalized
{
exp-=_f32_man_bits-1; // change exp pointing from msb to lsb (ignoring implicit bit)
txt=AnsiString().sprintf("%c%06Xh>>%3i",sign,man,-exp);
}
else if (exp==+bias+1) // Inf,NaN
{
if (man==0) txt=AnsiString().sprintf("%cInf ",sign);
else txt=AnsiString().sprintf("%cNaN ",sign);
man=0; exp=0;
}
else{
exp-=_f32_man_bits; // change exp pointing from msb to lsb
man|=_f32_exp_lsb; // implicit msb mantisa bit for normalized numbers
txt=AnsiString().sprintf("%06X",man);
if (exp<0) txt=AnsiString().sprintf("%c%06Xh>>%3i",sign,man,-exp);
else txt=AnsiString().sprintf("%c%06Xh<<%3i",sign,man,+exp);
}
// reconstruct man,exp back to float
float y=float(man)*pow(2.0,exp);
Form1->mm_log->Lines->Add(AnsiString().sprintf("%21.10f = %s = %21.10f",x,txt,y));
}
//---------------------------------------------------------------------------
//--- Builder: --------------------------------------------------------------
//---------------------------------------------------------------------------
__fastcall TForm1::TForm1(TComponent* Owner):TForm(Owner)
{
mm_log->Lines->Add("[Float]\r\n");
f32_disect(123*pow(2.0,-127-22)); // Denormalizxed
f32_disect(+0.0); // Zero
f32_disect(-0.0); // Zero
f32_disect(+0.0/0.0); // NaN
f32_disect(-0.0/0.0); // NaN
f32_disect(+1.0/0.0); // Inf
f32_disect(-1.0/0.0); // Inf
f32_disect(+123.456); // Normalized
f32_disect(-0.000123); // Normalized
mm_log->Lines->Add("\r\n[Double]\r\n");
f64_disect(123*pow(2.0,-127-22)); // Denormalizxed
f64_disect(+0.0); // Zero
f64_disect(-0.0); // Zero
f64_disect(+0.0/0.0); // NaN
f64_disect(-0.0/0.0); // NaN
f64_disect(+1.0/0.0); // Inf
f64_disect(-1.0/0.0); // Inf
f64_disect(+123.456); // Normalized
f64_disect(-0.000123); // Normalized
mm_log->Lines->Add("\r\n[Fixed]\r\n");
const int n=10;
float fx=12.345,fy=4.321,fm=1<<n;
int x=float(fx*fm);
int y=float(fy*fm);
mm_log->Lines->Add(AnsiString().sprintf("%7.3f + %7.3f = %8.3f = %8.3f",fx,fy,fx+fy,float(int((x+y) ))/fm));
mm_log->Lines->Add(AnsiString().sprintf("%7.3f - %7.3f = %8.3f = %8.3f",fx,fy,fx-fy,float(int((x-y) ))/fm));
mm_log->Lines->Add(AnsiString().sprintf("%7.3f * %7.3f = %8.3f = %8.3f",fx,fy,fx*fy,float(int((x*y)>>n))/fm));
mm_log->Lines->Add(AnsiString().sprintf("%7.3f / %7.3f = %8.3f = %8.3f",fx,fy,fx/fy,float(int((x/y)<<n))/fm
+float(int(((x%y)<<n)/y))/fm));
}
//---------------------------------------------------------------------------
Which might help you understand a bit more ... If you're interested then look also at this:
exponent bias
It was selected as midle between the range edges:
bias = (0+255)/2 = 127
to simply have the same range for positive and negative exponents as possible
modulo
using exp=rexp%127
will not give you negative values from unsigned rexp
no matter what not to mention division is slow operation (at least at the time the specs was created)... That is why exp=rexp-bias
Upvotes: 1