Reputation: 113
I'm trying to implement fast atan2(float) with 11 bits accuracy in mantissa. The atan2 implementation will be used for image processing. So it might be better to be implemented with SIMD instruction(The impl targeting x86(with SSE2) & ARM(with vpfv4 NEON)).
For now, I use chebyshev polynomial approximation(https://jp.mathworks.com/help/fixedpoint/examples/calculate-fixed-point-arctangent.html).
I'm willing to implement vectorized code manually. I will use SSE2(or later) & NEON wrapper library. I plan or tried these implementation option
else good idea?
Upvotes: 5
Views: 1526
Reputation: 26195
The first thing you would want to check is whether your compiler is able to vectorize atan2f (y,x)
when applied to an array of float
s. This usually requires at least a high optimization level such as -O3
and possibly specifying a relaxed "fast math" mode, in which errno
handling, denormals and special inputs such as infinities and NaNs are largely ignored. With this approach, accuracy may be well in excess of what is required, but it may be hard to beat a carefully tuned library implementation with respect to performance.
The next thing to try is to write a simple scalar implementation with sufficient accuracy, and have the compiler vectorize it. Typically this means avoiding anything but very simple branches which can be converted to branchless code through if-conversion. An example of such code is the fast_atan2f()
shown below. With the Intel compiler, invoked as icl /O3 /fp:precise /Qvec_report=2 my_atan2f.c
, this is vectorized successfully: my_atan2f.c(67): (col. 9) remark: LOOP WAS VECTORIZED.
Double checking the generated code through disassembly shows that fast_atan2f()
has been inlined and vectorized using SSE instructions of the *ps
flavor.
If all else fails, you could translate the code of fast_atan2()
into platform-specific SIMD intrinsics by hand, which should not be all that hard to do. I do not have sufficient experience to do it quickly though.
I have used a very simple algorithm in this code, which is a simple argument reduction to produce a reduced argument in [0,1], followed by a minimax polynomial approximation, and a final step mapping the result back to the full circle. Core approximation was generated with the Remez algorithm and is evaluated using a 2nd-order Horner scheme. Fused-multiply add (FMA) can be used where available. Special cases involving infinities, NaNs, or 0/0
are not handled, for the sake of performance.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
/* maximum relative error about 3.6e-5 */
float fast_atan2f (float y, float x)
{
float a, r, s, t, c, q, ax, ay, mx, mn;
ax = fabsf (x);
ay = fabsf (y);
mx = fmaxf (ay, ax);
mn = fminf (ay, ax);
a = mn / mx;
/* Minimax polynomial approximation to atan(a) on [0,1] */
s = a * a;
c = s * a;
q = s * s;
r = 0.024840285f * q + 0.18681418f;
t = -0.094097948f * q - 0.33213072f;
r = r * s + t;
r = r * c + a;
/* Map to full circle */
if (ay > ax) r = 1.57079637f - r;
if (x < 0) r = 3.14159274f - r;
if (y < 0) r = -r;
return r;
}
/* Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007 */
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579) /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)
float rand_float(void)
{
volatile union {
float f;
unsigned int i;
} cvt;
do {
cvt.i = KISS;
} while (isnan(cvt.f) || isinf (cvt.f) || (fabsf (cvt.f) < powf (2.0f, -126)));
return cvt.f;
}
int main (void)
{
const int N = 10000;
const int M = 100000;
float ref, relerr, maxrelerr = 0.0f;
float arga[N], argb[N], res[N];
int i, j;
printf ("testing atan2() with %d test vectors\n", N*M);
for (j = 0; j < M; j++) {
for (i = 0; i < N; i++) {
arga[i] = rand_float();
argb[i] = rand_float();
}
// This loop should be vectorized
for (i = 0; i < N; i++) {
res[i] = fast_atan2f (argb[i], arga[i]);
}
for (i = 0; i < N; i++) {
ref = (float) atan2 ((double)argb[i], (double)arga[i]);
relerr = (res[i] - ref) / ref;
if ((fabsf (relerr) > maxrelerr) &&
(fabsf (ref) >= powf (2.0f, -126))) { // result not denormal
maxrelerr = fabsf (relerr);
}
}
};
printf ("max rel err = % 15.8e\n\n", maxrelerr);
printf ("atan2(1,0) = % 15.8e\n", fast_atan2f(1,0));
printf ("atan2(-1,0) = % 15.8e\n", fast_atan2f(-1,0));
printf ("atan2(0,1) = % 15.8e\n", fast_atan2f(0,1));
printf ("atan2(0,-1) = % 15.8e\n", fast_atan2f(0,-1));
return EXIT_SUCCESS;
}
The output of the program above should look similar to the following:
testing atan2() with 1000000000 test vectors
max rel err = 3.53486939e-005
atan2(1,0) = 1.57079637e+000
atan2(-1,0) = -1.57079637e+000
atan2(0,1) = 0.00000000e+000
atan2(0,-1) = 3.14159274e+000
Upvotes: 7