Range Reduction algorithm for simple precision float

Question

I am trying to implement a small library of mathematical functions for 32-bit floats (simple precision) as part of one of my java projects. When it comes to calculating the sine of very large arguments, I can't manage to implement the Payne-Hanek range-reduction algorithm.

I know there are many threads about this algorithm on stackoverflow, and I've looked at and analyzed various explanatory articles such as ARGUMENT REDUCTION FOR HUGE ARGUMENTS: Good to the Last Bit (K. C. Ng and the members of the FP group of SunPro), so I think I understand it pretty well for double precision float, but after several days of reflection I still can't implement this algorithm in my 32-bits case. Is there a place where I could find a clear explanation of my particular case, and if not, a similar algorithm for my problem? (For the sinus of average argument x, I already use the Cody-Waite reduction, which however becomes very imprecise above 2^23 (about 10e9)).

Thanks in advance for your help, I'm still a programming novice and this is my first question on StackOverflow. I apologize for the possible mistakes in my English as it is not my first language.

EDIT :

In my implementation attempt, I've taken up the method described in ARGUMENT REDUCTION FOR HUGE ARGUMENTS: Good to the Last Bit (https://redirect.cs.umbc.edu/~phatak/645/supl/Ng-ArgReduction.pdf), which I've tried to modify for 32-bit floats as follows (I won't re-explain the principle of the algorithm to avoid a lengthy message):

The result is a 23-bit mantissa for float x, with an exponent between -128 and 127. In the case of simple precision, I have considered that the float closest to a multiple of pi/2 y is at least 2^-31 (It might be 30 or 29 but I never explicitly calculated it). For our calculation, we must therefore store at least T bits of 2/pi with :

T = max_exponent + mantissa + bits_for_worst_reduction_case = 23 + 127 + 31 = 181.

Since these bits will be stored as int (32bits), I choose to directly store 192 of them (i.e. 6 blocks).however, I use this algorithm for floats greater than 2^23 (min_exponent = 23). Posing M the number of trailing zeros before the binary point, the only part used among all these stored bits is the bits between the M-1st and the M+bits_for_worst_reduction_case+min_exposant = M+54th

(the reason for using only these bits is very well described in the paper presented above, but quickly :

The bits preceding the M-2nd included are such that x * these bits = 4*N where N is an odd integer by definition of M.

The bits following the M+55th are such that x*these bits <= 2^32 (so it doesn't matter in our calculation, even in the worst-case scenario).

The only thing that remains is to calculate x * B. The method is explained in the article, with the difference that here, in order to obtain perfectly equal result, we need to separate x and B into x_i and b_j blocks of at most 10 bits in length, so that x_i * b_j is a 20 significant bit number and the final multiplication (which for each result block y_i requires 2 additions of x_i * b_j sub-blocks) is a n = 22 significant bit number (so that we have n<23 and those bits can fit in a float mantissa).

Now that we have the results block y_i, we can have the remainder r of x%(pi/2) with :

y_i = y_i%4
y = sum(y_i, 1
t = (y1-y) + sum(y_i, 2
k = round(y)
f = (y-k) + t
r = f*pi/2.

My attempt of implementation is below. There is a few mistakes :

I do not know how to compute the bits of 2OverPi=0.6366... . Should I put them as the int 6366... or as the bits of the float 0.6366...
I can't manage to compute the number of trailing zeros before the binary point of my float. My attempt is in the countTrailingZeros() method.


public static final float piOver2 = 1.57079637051f; //0x3fc90fdb;

    /* 192 bits de Pi/2 for reduction : */
    public static final int TwoOverPiBits[] = {
            0x00000000,
            0x28be60db,
            0x9391054a,
            0x7f09d5f4,
            0x7d4d3770,
            0x36d8a566,
            0x4f10e410
    };

    // return the exponant of a float in simple precision
    public int getExp(float x) {
        int bits = Float.floatToIntBits(x);
        int exp = ((bits >> 23) & 0xFF) - 127;
        return exp;
    }

    // return the mantissa of a float in simple precision
    public int getMant(float x) {
        int bits = Float.floatToIntBits(x);
        int mant = bits & 0x7fffff;
        return mant;
    }

    // return the sign of a float in simple precision
    public int getSgn(float x) {
        int bits = Float.floatToIntBits(x);
        int sgn = (bits >> 31) & 1;
        return sgn;
    }

    // Method to extract 55 bits from TwoOverPiBits for the index k and stock them in 10 bits blocks
    public int[] packBits(int k) {
        int numPackets = 6;
        int[] packets = new int[numPackets];

        int bitIndex = k; // Index of first bit
        for (int i = 0; i < numPackets; i++) {
            packets[i] = extract12Bits(bitIndex);
            bitIndex += 10;
        }

        return packets;
    }

    // Method to extract 10 bits from TwoOverPiBits at a given index
    public int extract12Bits(int bitIndex) {
        int intIndex = bitIndex / 32; // Index de l'entier dans le tableau
        int bitOffset = bitIndex % 32; // Décalage du bit dans l'entier

        int result;
        if (bitOffset <= 22) {
            // All bits are in the same TwoOverPiBits block
            result = (deuxSurPiBits[intIndex] >> bitOffset) & 0x3FF; // Masqk to extract 10 bits
        } else {
            // 10 bits are in two cosecutive TwoOverPiBits blocks
            int bitsFirstPart = deuxSurPiBits[intIndex] >>> bitOffset;
            int bitsSecondPart = deuxSurPiBits[intIndex + 1] & ((1 << (10 - (32 - bitOffset))) - 1);
            result = (bitsSecondPart << (32 - bitOffset)) | bitsFirstPart;
        }

        return result;
    }

    //Methode to separate a float in 4 10 bits blocks
    public int[] extract12BitPackets(float x) {

        int bits = Float.floatToIntBits(x);

        int[] packets = new int[4];

        // Extract the 4 blocks : 32 bits in 2-10-10-10
        // First block : 10 bits but only 2 that are informative : 1 sign bit and the first exposant bit
        // Second block : 7 exposant bits and 3 mantissa bits
        packets[1] = (bits >> 20) & 0x3FF;

        // third block : 10 mantissa bits
        packets[2] = (bits >> 10) & 0x3FF;

        // last block : 10 last mantissa bits
        packets[3] = (bits & 0x3FF);

        return packets;
    }
    
    //Method to count number of trailing zeros before the binary point
    public int countTrailingZeros(int number) {
        if (number == 0) return 32;

        int count = 0;
        while ((number & 1) == 0) {
            count++;
            number >>= 1;
        }
        return count;
    }


    public float payneHanek(float x) {

        if (x<0){
            return -payneHanek(-x);
        }
        int q;
        float f, r;
        int k = getExp(x);
        int mant = getMant(x);
        int M = (k - 23) + countTrailingZeros(mant);

        int[] bits552Overpi = packBits(M - 6); //We begin at M-6 but we then erase the first five bits with the &0x3FF
        int 2Overpibits5 = bits552Overpi[5];
        int 2Overpibits4 = bits552Overpi[4];
        int 2Overpibits3 = bits552Overpi[3];
        int 2Overpibits2 = bits552Overpi[2];
        int 2Overpibits1 = bits552Overpi[1];
        int 2Overpibits0 = bits552Overpi[0] & 0x3FF;

        //Separate x in 10 bits block
        int[] bits32x = extract12BitPackets(x);
        int xbits3 = bits32x[3];
        int xbits2 = bits32x[2];
        int xbits1 = bits32x[1];
        int xbits0 = bits32x[0];

        //x*2OverPi with the blocks
        int y8 = xbits3 * 2Overpibits5;
        int y7 = xbits3 * 2Overpibits4 + xbits2 * 2Overpibits5;
        int y6 = xbits3 * 2Overpibits3 + xbits2 * 2Overpibits4 + xbits1 * 2Overpibits5;
        int y5 = xbits3 * 2Overpibits2 + xbits2 * 2Overpibits3 + xbits1 * 2Overpibits4 + xbits0 * 2Overpibits5;
        int y4 = xbits3 * 2Overpibits1 + xbits2 * 2Overpibits2 + xbits1 * 2Overpibits3 + xbits0 * 2Overpibits4;
        int y3 = xbits3 * 2Overpibits0 + xbits2 * 2Overpibits1 + xbits1 * 2Overpibits2 + xbits0 * 2Overpibits3;
        int y2 = xbits2 * 2Overpibits0 + xbits1 * 2Overpibits1 + xbits0 * 2Overpibits2;
        int y1 = xbits1 * 2Overpibits0 + xbits0 * 2Overpibits1;
        int y0 = xbits0 * 2Overpibits0;


        //Modulo of each blocks : those variables are exact
        y8 = y8 % 4;
        y7 = y7 % 4;
        y6 = y6 % 4;
        y5 = y5 % 4;
        y4 = y4 % 4;
        y3 = y3 % 4;
        y2 = y2 % 4;
        y1 = y1 % 4;
        y0 = y0 % 4;

        //final result :
        float y = y0 + (y1 + (y2 + (y3 + (y4 + (y5 + (y6 + (y7 + y8)))))));
        //error :
        int error = (((((((y1 - y) + y2) + y3) + y4) + y5) + y6) + y7) + y8;
        q = (int) Math.rint(y);
        f = (y - q) + error;
        r = f * piSur2;
        
        //Calculate sin(r)....

Range Reduction algorithm for simple precision float

Answers (0)

Related Questions