Reputation: 20628
I have a question about the pack754()
function defined in Section 7.4 of Beej's Guide to Network Programming.
This function converts a floating point number f
into its IEEE 754 representation where bits
is the total number of bits to represent the number and expbits
is the number of bits used to represent only the exponent.
I am concerned with single-precision floating numbers only, so for this question, bits
is specified as 32
and expbits
is specified as 8
. This implies that 23
bits is used to store the significand (because one bit is sign bit).
My question is about this line of code.
significand = fnorm * ((1LL<<significandbits) + 0.5f);
What is the role of + 0.5f
in this code?
Here is a complete code that uses this function.
#include <stdio.h>
#include <stdint.h> // defines uintN_t types
#include <inttypes.h> // defines PRIx macros
uint64_t pack754(long double f, unsigned bits, unsigned expbits)
{
long double fnorm;
int shift;
long long sign, exp, significand;
unsigned significandbits = bits - expbits - 1; // -1 for sign bit
if (f == 0.0) return 0; // get this special case out of the way
// check sign and begin normalization
if (f < 0) { sign = 1; fnorm = -f; }
else { sign = 0; fnorm = f; }
// get the normalized form of f and track the exponent
shift = 0;
while(fnorm >= 2.0) { fnorm /= 2.0; shift++; }
while(fnorm < 1.0) { fnorm *= 2.0; shift--; }
fnorm = fnorm - 1.0;
// calculate the binary form (non-float) of the significand data
significand = fnorm * ((1LL<<significandbits) + 0.5f);
// get the biased exponent
exp = shift + ((1<<(expbits-1)) - 1); // shift + bias
// return the final answer
return (sign<<(bits-1)) | (exp<<(bits-expbits-1)) | significand;
}
int main(void)
{
float f = 3.1415926;
uint32_t fi;
printf("float f: %.7f\n", f);
fi = pack754(f, 32, 8);
printf("float encoded: 0x%08" PRIx32 "\n", fi);
return 0;
}
What purpose does + 0.5f
serve in this code?
Upvotes: 3
Views: 566
Reputation: 153507
The code is an incorrect attempt at rounding.
long double fnorm;
long long significand;
unsigned significandbits
...
significand = fnorm * ((1LL<<significandbits) + 0.5f); // bad code
The first clue of incorrectness is the f
of 0.5f
, which indicates float
, is a nonsensical introduction of specifying float
in a routine with long double f
and fnorm
. float
math has no application in the function.
Yet adding 0.5f
does not mean that the code is limited to float
math in (1LL<<significandbits) + 0.5f
. See FLT_EVAL_METHOD
which may allow higher precision intermediate results and have fooled the code author in testing.
A rounding attempt does make sense as the argument is long double
and the target representations are narrower. Adding 0.5
is a common approach - but it is not done right here. IMO, the lack of the author commenting here concerning 0.5f
hinted that the intent was "obvious" - not subtle, albeit incorrect.
As commented, moving the 0.5
is closer to being correct for rounding, but may mis-lead some into thinking the addition is done with float
math, (it is long double
math adding a long double
product to float
causes the 0.5f
to be promoted to long double
first).
// closer to rounding but may mislead
significand = fnorm * (1LL<<significandbits) + 0.5f;
// better
significand = fnorm * (1LL<<significandbits) + 0.5L; // or 0.5l or simply 0.5
To round, without calling the preferred <math.h>
rounds routines like rintl(), roundl(), nearbyintl(), llrintl()
, adding the explicit type 0.5 is still a weak attempt at rounding. It is weak because it rounds incorrectly with many cases. The +0.5 trick relies on that sum being exact.
Consider
long double product = fnorm * (1LL<<significandbits);
long long significand = product + 0.5; // double rounding?
product + 0.5
itself may go through a rounding before truncation/assignment to long long
- in effect double rounding.
Best to use the right tool in the C shed of standard library functions.
significand = llrintl(fnorm * (1ULL<<significandbits));
A corner case remains with this rounding is where significand
is now one too great and significand , exp
needs adjustment. As well identified by @Nayuki, code has other short-comings too. Also, it fails on -0.0
.
Upvotes: 3
Reputation: 18533
The + 0.5f
serves no purpose in the code, and may be harmful or misleading.
The expression (1LL<<significandbits) + 0.5f
results in a float
. But even for the small case of significandbits = 23
for single-precision floating-point, the expression evaluates to (float)(223 + 0.5), which rounds to exactly 223 (round half even).
Replacing + 0.5f
with + 0.0f
results in the same behavior. Heck, drop that term entirely, because fnorm
will cause the right-hand side argument of *
to be casted to long double
anyway. This would be a better way to rewrite the line: long long significand = fnorm * (long double)(1LL << significandbits);
Side note: This implementation of pack754()
handles zero correctly (and collapses negative zero to positive zero), but mishandles subnormal numbers (wrong bits), infinities (infinite loop), and NaN (wrong bits). It's best to not treat it as a reference model function.
Upvotes: 2