Reputation: 32143
At the end of the question, an code for calculating floating modulo operation (__ieee754_fmod
) from GNU C library is presented. I'm interested in the basic ideas behind this algorithm.
Of particular interest is the code marked / * fix point fmod * /
.
/* fix point fmod */
n = ix - iy;
while(n--) {
hz=hx-hy;
if(hz<0){hx = hx+hx;}
else {
if(hz==0) /* return sign(x)*0 */
return Zero[(uint64_t)sx>>63];
hx = hz+hz;
}
}
hz=hx-hy;
if(hz>=0) {hx=hz;}
How does fixed-point arithmetic apply to floating-point numbers?
How it can be changed to apply to floating-point numbers in decimal notation?
Does it have some common with this algorithm for finding the remainder ?
while(x >= y){
auto scaled_y = y;
while(scaled_y*2 < x)
scaled_y *= 2;
x -= scaled_y;
}
Simplified examples or links to detailed descriptions are encouraged.
From GNU C library source code:
typedef union
{
double value;
struct
{
uint32_t lsw;
uint32_t msw;
} parts;
uint64_t word;
} ieee_double_shape_type;
/* Get all in one, efficient on 64-bit machines. */
#ifndef EXTRACT_WORDS64
# define EXTRACT_WORDS64(i,d) \
do { \
ieee_double_shape_type gh_u; \
gh_u.value = (d); \
(i) = gh_u.word; \
} while (0)
#endif
/* Get all in one, efficient on 64-bit machines. */
#ifndef INSERT_WORDS64
# define INSERT_WORDS64(d,i) \
do { \
ieee_double_shape_type iw_u; \
iw_u.word = (i); \
(d) = iw_u.value; \
} while (0)
#endif
static const double one = 1.0, Zero[] = {0.0, -0.0,};
double __ieee754_fmod (double x, double y)
{
int32_t n,ix,iy;
int64_t hx,hy,hz,sx,i;
EXTRACT_WORDS64(hx,x);
EXTRACT_WORDS64(hy,y);
sx = hx&UINT64_C(0x8000000000000000); /* sign of x */
hx ^=sx; /* |x| */
hy &= UINT64_C(0x7fffffffffffffff); /* |y| */
/* purge off exception values */
if(__builtin_expect(hy==0
|| hx >= UINT64_C(0x7ff0000000000000)
|| hy > UINT64_C(0x7ff0000000000000), 0))
/* y=0,or x not finite or y is NaN */
return (x*y)/(x*y);
if(__builtin_expect(hx<=hy, 0)) {
if(hx<hy) return x; /* |x|<|y| return x */
return Zero[(uint64_t)sx>>63]; /* |x|=|y| return x*0*/
}
/* determine ix = ilogb(x) */
if(__builtin_expect(hx<UINT64_C(0x0010000000000000), 0)) {
/* subnormal x */
for (ix = -1022,i=(hx<<11); i>0; i<<=1) ix -=1;
} else ix = (hx>>52)-1023;
/* determine iy = ilogb(y) */
if(__builtin_expect(hy<UINT64_C(0x0010000000000000), 0)) { /* subnormal y */
for (iy = -1022,i=(hy<<11); i>0; i<<=1) iy -=1;
} else iy = (hy>>52)-1023;
/* set up hx, hy and align y to x */
if(__builtin_expect(ix >= -1022, 1))
hx = UINT64_C(0x0010000000000000)|(UINT64_C(0x000fffffffffffff)&hx);
else { /* subnormal x, shift x to normal */
n = -1022-ix;
hx<<=n;
}
if(__builtin_expect(iy >= -1022, 1))
hy = UINT64_C(0x0010000000000000)|(UINT64_C(0x000fffffffffffff)&hy);
else { /* subnormal y, shift y to normal */
n = -1022-iy;
hy<<=n;
}
/* fix point fmod */
n = ix - iy;
while(n--) {
hz=hx-hy;
if(hz<0){hx = hx+hx;}
else {
if(hz==0) /* return sign(x)*0 */
return Zero[(uint64_t)sx>>63];
hx = hz+hz;
}
}
hz=hx-hy;
if(hz>=0) {hx=hz;}
/* convert back to floating value and restore the sign */
if(hx==0) /* return sign(x)*0 */
return Zero[(uint64_t)sx>>63];
while(hx<UINT64_C(0x0010000000000000)) { /* normalize x */
hx = hx+hx;
iy -= 1;
}
if(__builtin_expect(iy>= -1022, 1)) { /* normalize output */
hx = ((hx-UINT64_C(0x0010000000000000))|((uint64_t)(iy+1023)<<52));
INSERT_WORDS64(x,hx|sx);
} else { /* subnormal output */
n = -1022 - iy;
hx>>=n;
INSERT_WORDS64(x,hx|sx);
x *= one; /* create necessary signal */
}
return x; /* exact output */
}
Upvotes: -1
Views: 487
Reputation: 222679
- How does fixed-point arithmetic apply to floating-point numbers?
See below:
/* fix point fmod */
At this point, I presume (having not examined the prior code thoroughly) ix
and iy
are the floating-point exponents of the inputs x
and y
, hx
and hy
are their significands, and |y
| ≤ |x
|. The comment “fix point fmod” refers to the fact that this code is working on the significands of the floating-point numbers. However, that is an inapt description; it is not truly a fixed-point fmod
, because it does use the exponents to do some scaling, as we will see.
n = ix - iy;
This sets n
to the difference in the exponents.
while(n--) {
This starts a loop dealing with the different scaling of the numbers due to the exponents.
hz=hx-hy;
This performs a trial subtraction; after testing the sign, we might or might not use this result. Note that, regardless of the difference in exponents between x
and y
, hy
is some multiple of y
relative to hx
. That is, even though the significands might not be aligned for a normal subtraction, the number represented by hz
has the same residue modulo y
that the number represented by hx
does, since all numbers that differ by a multiple of y
have the same residue.
if(hz<0){hx = hx+hx;}
If hz
is negative, we do not want to deal with hx-hy
yet. Instead, hx
is shifted left one bit, moving its scaling more toward that of hy
. (Recall we are in a loop on n
, so, ultimately, shifting hx
left n
bits will put hx
and hy
on the same scale.)
else {
if(hz==0) /* return sign(x)*0 */
return Zero[(uint64_t)sx>>63];
hx = hz+hz;
}
If hz
is zero, we haven proven x
is a multiple of y
, so the residue is zero, and that is returned, with an appropriate sign bit. Otherwise, hz
is positive. In this case, hx = hz+hz;
does two things: First, it replaces hx
with hz
, which is a valid substitution because hx
and hz
have the same residue modulo y
, but hz
is smaller, so we have reduced it somewhat. Second, it shifts hx
left one bit, preparing us for the next loop iteration.
}
This continues iterating on n
.
hz=hx-hy;
This is another trial subtraction; after testing the sign, we might or might not use the result. At this point, hx
has been reduced modulo y
by the preceding loop, and we are testing to see whether one final subtraction is needed or we are done.
if(hz>=0) {hx=hz;}
If hz
is less than zero, the loop completely reduced hx
; it produced a residue less than hy
, and we are done: hx
contains the residue to use. If hz
is greater than or equal to zero, hx
was not completely reduced, so we replace it by hz
, we contains hx
reduced by subtracting hy
. This is the final subtraction needed—we know there are no more because we are working in binary and the leading bit of hy
is set due to normalization, so hx/hy
must be less than two, which then implies hx-hy < hy
, and the goal of fmod
is to produce a residue less than |y
|.
- How it can be changed to apply to floating-point numbers in decimal notation?
The algorithm would have to be rewritten considerably to apply to floating-point numbers. In particular, the code above relies on binary in that only one subtraction is needed at each step; the integer part of the quotient hx
/hy
is always 0 or 1. With a decimal representation, accommodation would have to be made for the fact that the integer part of the quotient could range from 0 to 9, so various multiples of hy
would have to be subtracted.
- Does it have some common with this algorithm for finding the remainder ?
while(x >= y){ auto scaled_y = y; while(scaled_y*2 < x) scaled_y *= 2; x -= scaled_y;
Some. That code also subtract multiples of y
from x
, for the purpose of reducing x
modulo y
.
Upvotes: 3