alex_noname
alex_noname

Reputation: 32143

What is the mathematical basis of modulo fixed-point algorithm for IEEE 754 numbers

At the end of the question, an code for calculating floating modulo operation (__ieee754_fmod ) from GNU C library is presented. I'm interested in the basic ideas behind this algorithm.

Of particular interest is the code marked / * fix point fmod * /.

    /* fix point fmod */
    n = ix - iy;
    while(n--) {
        hz=hx-hy;
        if(hz<0){hx = hx+hx;}
        else {
            if(hz==0)                /* return sign(x)*0 */
                return Zero[(uint64_t)sx>>63];
            hx = hz+hz;
        }
    }
    hz=hx-hy;
    if(hz>=0) {hx=hz;}

  1. How does fixed-point arithmetic apply to floating-point numbers?

  2. How it can be changed to apply to floating-point numbers in decimal notation?

  3. Does it have some common with this algorithm for finding the remainder ?

while(x >= y){
    auto scaled_y = y;
    while(scaled_y*2 < x)
        scaled_y *= 2;
    x -= scaled_y;
}

Simplified examples or links to detailed descriptions are encouraged.

From GNU C library source code:

typedef union
{
    double value;
    struct
    {
        uint32_t lsw;
        uint32_t msw;
    } parts;
    uint64_t word;
} ieee_double_shape_type;


/* Get all in one, efficient on 64-bit machines.  */
#ifndef EXTRACT_WORDS64
# define EXTRACT_WORDS64(i,d)                                        \
do {                                                                \
  ieee_double_shape_type gh_u;                                        \
  gh_u.value = (d);                                                \
  (i) = gh_u.word;                                                \
} while (0)
#endif

/* Get all in one, efficient on 64-bit machines.  */
#ifndef INSERT_WORDS64
# define INSERT_WORDS64(d,i)                                        \
do {                                                                \
  ieee_double_shape_type iw_u;                                        \
  iw_u.word = (i);                                                \
  (d) = iw_u.value;                                                \
} while (0)
#endif


static const double one = 1.0, Zero[] = {0.0, -0.0,};
double __ieee754_fmod (double x, double y)
{
    int32_t n,ix,iy;
    int64_t hx,hy,hz,sx,i;
    EXTRACT_WORDS64(hx,x);
    EXTRACT_WORDS64(hy,y);
    sx = hx&UINT64_C(0x8000000000000000);        /* sign of x */
    hx ^=sx;                                /* |x| */
    hy &= UINT64_C(0x7fffffffffffffff);        /* |y| */
    /* purge off exception values */
    if(__builtin_expect(hy==0
                        || hx >= UINT64_C(0x7ff0000000000000)
                        || hy > UINT64_C(0x7ff0000000000000), 0))
        /* y=0,or x not finite or y is NaN */
        return (x*y)/(x*y);
    if(__builtin_expect(hx<=hy, 0)) {
        if(hx<hy) return x;        /* |x|<|y| return x */
        return Zero[(uint64_t)sx>>63];        /* |x|=|y| return x*0*/
    }
    /* determine ix = ilogb(x) */
    if(__builtin_expect(hx<UINT64_C(0x0010000000000000), 0)) {
        /* subnormal x */
        for (ix = -1022,i=(hx<<11); i>0; i<<=1) ix -=1;
    } else ix = (hx>>52)-1023;
    /* determine iy = ilogb(y) */
    if(__builtin_expect(hy<UINT64_C(0x0010000000000000), 0)) {        /* subnormal y */
        for (iy = -1022,i=(hy<<11); i>0; i<<=1) iy -=1;
    } else iy = (hy>>52)-1023;
    /* set up hx, hy and align y to x */
    if(__builtin_expect(ix >= -1022, 1))
        hx = UINT64_C(0x0010000000000000)|(UINT64_C(0x000fffffffffffff)&hx);
    else {                /* subnormal x, shift x to normal */
        n = -1022-ix;
        hx<<=n;
    }
    if(__builtin_expect(iy >= -1022, 1))
        hy = UINT64_C(0x0010000000000000)|(UINT64_C(0x000fffffffffffff)&hy);
    else {                /* subnormal y, shift y to normal */
        n = -1022-iy;
        hy<<=n;
    }
    /* fix point fmod */
    n = ix - iy;
    while(n--) {
        hz=hx-hy;
        if(hz<0){hx = hx+hx;}
        else {
            if(hz==0)                /* return sign(x)*0 */
                return Zero[(uint64_t)sx>>63];
            hx = hz+hz;
        }
    }
    hz=hx-hy;
    if(hz>=0) {hx=hz;}
    /* convert back to floating value and restore the sign */
    if(hx==0)                        /* return sign(x)*0 */
        return Zero[(uint64_t)sx>>63];
    while(hx<UINT64_C(0x0010000000000000)) {        /* normalize x */
        hx = hx+hx;
        iy -= 1;
    }
    if(__builtin_expect(iy>= -1022, 1)) {        /* normalize output */
        hx = ((hx-UINT64_C(0x0010000000000000))|((uint64_t)(iy+1023)<<52));
        INSERT_WORDS64(x,hx|sx);
    } else {                /* subnormal output */
        n = -1022 - iy;
        hx>>=n;
        INSERT_WORDS64(x,hx|sx);
        x *= one;                /* create necessary signal */
    }
    return x;                /* exact output */
}

Upvotes: -1

Views: 487

Answers (1)

Eric Postpischil
Eric Postpischil

Reputation: 222679

  1. How does fixed-point arithmetic apply to floating-point numbers?

See below:

    /* fix point fmod */

At this point, I presume (having not examined the prior code thoroughly) ix and iy are the floating-point exponents of the inputs x and y, hx and hy are their significands, and |y| ≤ |x|. The comment “fix point fmod” refers to the fact that this code is working on the significands of the floating-point numbers. However, that is an inapt description; it is not truly a fixed-point fmod, because it does use the exponents to do some scaling, as we will see.

    n = ix - iy;

This sets n to the difference in the exponents.

    while(n--) {

This starts a loop dealing with the different scaling of the numbers due to the exponents.

        hz=hx-hy;

This performs a trial subtraction; after testing the sign, we might or might not use this result. Note that, regardless of the difference in exponents between x and y, hy is some multiple of y relative to hx. That is, even though the significands might not be aligned for a normal subtraction, the number represented by hz has the same residue modulo y that the number represented by hx does, since all numbers that differ by a multiple of y have the same residue.

        if(hz<0){hx = hx+hx;}

If hz is negative, we do not want to deal with hx-hy yet. Instead, hx is shifted left one bit, moving its scaling more toward that of hy. (Recall we are in a loop on n, so, ultimately, shifting hx left n bits will put hx and hy on the same scale.)

        else {
            if(hz==0)                /* return sign(x)*0 */
                return Zero[(uint64_t)sx>>63];
            hx = hz+hz;
        }

If hz is zero, we haven proven x is a multiple of y, so the residue is zero, and that is returned, with an appropriate sign bit. Otherwise, hz is positive. In this case, hx = hz+hz; does two things: First, it replaces hx with hz, which is a valid substitution because hx and hz have the same residue modulo y, but hz is smaller, so we have reduced it somewhat. Second, it shifts hx left one bit, preparing us for the next loop iteration.

    }

This continues iterating on n.

    hz=hx-hy;

This is another trial subtraction; after testing the sign, we might or might not use the result. At this point, hx has been reduced modulo y by the preceding loop, and we are testing to see whether one final subtraction is needed or we are done.

    if(hz>=0) {hx=hz;}

If hz is less than zero, the loop completely reduced hx; it produced a residue less than hy, and we are done: hx contains the residue to use. If hz is greater than or equal to zero, hx was not completely reduced, so we replace it by hz, we contains hx reduced by subtracting hy. This is the final subtraction needed—we know there are no more because we are working in binary and the leading bit of hy is set due to normalization, so hx/hy must be less than two, which then implies hx-hy < hy, and the goal of fmod is to produce a residue less than |y|.

  1. How it can be changed to apply to floating-point numbers in decimal notation?

The algorithm would have to be rewritten considerably to apply to floating-point numbers. In particular, the code above relies on binary in that only one subtraction is needed at each step; the integer part of the quotient hx/hy is always 0 or 1. With a decimal representation, accommodation would have to be made for the fact that the integer part of the quotient could range from 0 to 9, so various multiples of hy would have to be subtracted.

  1. Does it have some common with this algorithm for finding the remainder ?
while(x >= y){
    auto scaled_y = y;
    while(scaled_y*2 < x)
        scaled_y *= 2;
    x -= scaled_y;

Some. That code also subtract multiples of y from x, for the purpose of reducing x modulo y.

Upvotes: 3

Related Questions