atan2 approximation with 11bits in mantissa on x86(with SSE2) and ARM(with vfpv4 NEON)

Question

I'm trying to implement fast atan2(float) with 11 bits accuracy in mantissa. The atan2 implementation will be used for image processing. So it might be better to be implemented with SIMD instruction(The impl targeting x86(with SSE2) & ARM(with vpfv4 NEON)).

For now, I use chebyshev polynomial approximation(https://jp.mathworks.com/help/fixedpoint/examples/calculate-fixed-point-arctangent.html).

I'm willing to implement vectorized code manually. I will use SSE2(or later) & NEON wrapper library. I plan or tried these implementation option

vectorized Polynomial approximation
- chebyshev polynomial(now implemented)
scalar look up table ( + linear interpolation)
vectorized look up table (is this possible? I'm not familiar with NEON, so I don' know some instructions like VGATHER in NEON)
vectorized CORDIC method (this may be slow since it require 12+ rotation iterations to get 11bits accuracy)

else good idea?

njuffa · Accepted Answer

The first thing you would want to check is whether your compiler is able to vectorize atan2f (y,x) when applied to an array of floats. This usually requires at least a high optimization level such as -O3 and possibly specifying a relaxed "fast math" mode, in which errno handling, denormals and special inputs such as infinities and NaNs are largely ignored. With this approach, accuracy may be well in excess of what is required, but it may be hard to beat a carefully tuned library implementation with respect to performance.

The next thing to try is to write a simple scalar implementation with sufficient accuracy, and have the compiler vectorize it. Typically this means avoiding anything but very simple branches which can be converted to branchless code through if-conversion. An example of such code is the fast_atan2f() shown below. With the Intel compiler, invoked as icl /O3 /fp:precise /Qvec_report=2 my_atan2f.c, this is vectorized successfully: my_atan2f.c(67): (col. 9) remark: LOOP WAS VECTORIZED. Double checking the generated code through disassembly shows that fast_atan2f() has been inlined and vectorized using SSE instructions of the *ps flavor.

If all else fails, you could translate the code of fast_atan2() into platform-specific SIMD intrinsics by hand, which should not be all that hard to do. I do not have sufficient experience to do it quickly though.

I have used a very simple algorithm in this code, which is a simple argument reduction to produce a reduced argument in [0,1], followed by a minimax polynomial approximation, and a final step mapping the result back to the full circle. Core approximation was generated with the Remez algorithm and is evaluated using a 2nd-order Horner scheme. Fused-multiply add (FMA) can be used where available. Special cases involving infinities, NaNs, or 0/0 are not handled, for the sake of performance.

#include 
#include 
#include 

/* maximum relative error about 3.6e-5 */
float fast_atan2f (float y, float x)
{
    float a, r, s, t, c, q, ax, ay, mx, mn;
    ax = fabsf (x);
    ay = fabsf (y);
    mx = fmaxf (ay, ax);
    mn = fminf (ay, ax);
    a = mn / mx;
    /* Minimax polynomial approximation to atan(a) on [0,1] */
    s = a * a;
    c = s * a;
    q = s * s;
    r =  0.024840285f * q + 0.18681418f;
    t = -0.094097948f * q - 0.33213072f;
    r = r * s + t;
    r = r * c + a;
    /* Map to full circle */
    if (ay > ax) r = 1.57079637f - r;
    if (x <   0) r = 3.14159274f - r;
    if (y <   0) r = -r;
    return r;
}

/* Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007 */
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

float rand_float(void)
{
    volatile union {
        float f;
        unsigned int i;
    } cvt;
    do {
        cvt.i = KISS;
    } while (isnan(cvt.f) || isinf (cvt.f) || (fabsf (cvt.f) < powf (2.0f, -126)));
    return cvt.f;
}

int main (void)
{
    const int N = 10000;
    const int M = 100000;
    float ref, relerr, maxrelerr = 0.0f;
    float arga[N], argb[N], res[N];
    int i, j;

    printf ("testing atan2() with %d test vectors
", N*M);

    for (j = 0; j < M; j++) {
        for (i = 0; i < N; i++) {
            arga[i] = rand_float();
            argb[i] = rand_float();
        }

        // This loop should be vectorized
        for (i = 0; i < N; i++) {
            res[i] = fast_atan2f (argb[i], arga[i]);
        }

        for (i = 0; i < N; i++) {
            ref = (float) atan2 ((double)argb[i], (double)arga[i]);
            relerr = (res[i] - ref) / ref;
            if ((fabsf (relerr) > maxrelerr) && 
                (fabsf (ref) >= powf (2.0f, -126))) { // result not denormal
                maxrelerr = fabsf (relerr);
            }
        }
    };

    printf ("max rel err = % 15.8e

", maxrelerr);

    printf ("atan2(1,0)  = % 15.8e
", fast_atan2f(1,0));
    printf ("atan2(-1,0) = % 15.8e
", fast_atan2f(-1,0));
    printf ("atan2(0,1)  = % 15.8e
", fast_atan2f(0,1));
    printf ("atan2(0,-1) = % 15.8e
", fast_atan2f(0,-1));
    return EXIT_SUCCESS;
}

The output of the program above should look similar to the following:

testing atan2() with 1000000000 test vectors
max rel err =  3.53486939e-005

atan2(1,0)  =  1.57079637e+000
atan2(-1,0) = -1.57079637e+000
atan2(0,1)  =  0.00000000e+000
atan2(0,-1) =  3.14159274e+000

atan2 approximation with 11bits in mantissa on x86(with SSE2) and ARM(with vfpv4 NEON)

Answers (1)

Related Questions