Reputation: 26085

Best machine-optimized polynomial minimax approximation to arctangent on [-1,1]?

For the simple and efficient implementation of fast math functions with reasonable accuracy, polynomial minimax approximations are often the method of choice. Minimax approximations are typically generated with a variant of the Remez algorithm. Various widely available tools such as Maple and Mathematica have built-in functionality for this. The generated coefficients are typically computed using high-precision arithmetic. It is well-known that simply rounding those coefficients to machine precision leads to suboptimal accuracy in the resulting implementation.

Instead, one searches for closely related sets of coefficients that are exactly representable as machine numbers to generate a machine-optimized approximation. Two relevant papers are:

Nicolas Brisebarre, Jean-Michel Muller, and Arnaud Tisserand, "Computing Machine-Efficient Polynomial Approximations", ACM Transactions on Mathematical Software, Vol. 32, No. 2, June 2006, pp. 236–256.

Nicolas Brisebarre and Sylvain Chevillard, "Efficient polynomial L∞-approximations", 18th IEEE Symposium on Computer Arithmetic (ARITH-18), Montpellier (France), June 2007, pp. 169-176.

An implementation of the LLL-algorithm from the latter paper is available as the fpminimax() command of the Sollya tool. It is my understanding that all algorithms proposed for the generation of machine-optimized approximations are based on heuristics, and that it is therefore generally unknown what accuracy can be achieved by an optimal approximation. It is not clear to me whether the availability of FMA (fused multiply-add) for the evaluation of the approximation has an influence on the answer to that question. It seems to me naively that it should.

I am currently looking at a simple polynomial approximation for arctangent on [-1,1] that is evaluated in IEEE-754 single-precision arithmetic, using the Horner scheme and FMA. See function atan_poly() in the C99 code below. For lack of access to a Linux machine at the moment, I did not use Sollya to generate these coefficients, but used my own heuristic that could be loosely described as a mixture of steepest decent and simulated annealing (to avoid getting stuck on local minima). The maximum error of my machine-optimized polynomial is very close to 1 ulp, but ideally I would like the maximum ulp error to be below 1 ulp.

I am aware that I could change my computation to increase the accuracy, for example by using a leading coefficient represented to more than single-precision precision, but I would like to keep the code exactly as is (that is, as simple as possible) adjusting only the coefficients to deliver the most accurate result possible.

A "proven" optimal set of coefficients would be ideal, pointers to relevant literature are welcome. I did a literature search but could not find any paper that advances the state of the art meaningfully beyond Sollya's fpminimax(), and none that examine the role of FMA (if any) in this issue.

// max ulp err = 1.03143
float atan_poly (float a)
{
    float r, s;
    s = a * a;
    r =              0x1.7ed1ccp-9f;
    r = fmaf (r, s, -0x1.0c2c08p-6f);
    r = fmaf (r, s,  0x1.61fdd0p-5f);
    r = fmaf (r, s, -0x1.3556b2p-4f);
    r = fmaf (r, s,  0x1.b4e128p-4f);
    r = fmaf (r, s, -0x1.230ad2p-3f);
    r = fmaf (r, s,  0x1.9978ecp-3f);
    r = fmaf (r, s, -0x1.5554dcp-2f);
    r = r * s;
    r = fmaf (r, a, a);
    return r;
}

// max ulp err = 1.52637
float my_atanf (float a)
{
    float r, t;
    t = fabsf (a);
    r = t;
    if (t > 1.0f) {
        r = 1.0f / r;
    }
    r = atan_poly (r);
    if (t > 1.0f) {
        r = fmaf (0x1.ddcb02p-1f, 0x1.aee9d6p+0f, -r); // pi/2 - r
    }
    r = copysignf (r, a);
    return r;
}

Upvotes: 25

Answers (4)

njuffa

Reputation: 26085

I pondered the various ideas I received in comments and also ran a few experiments based on that feedback. In the end I decided that a refined heuristic search was the best way forward. I have now managed to reduce the maximum error for atanf_poly() to 1.01036 ulps, with just three arguments exceeding my stated goal of a 1 ulp error bound:

ulp = -1.00829 @ |a| =  9.80738342e-001 0x1.f62356p-1 (3f7b11ab)
ulp = -1.01036 @ |a| =  9.87551928e-001 0x1.f9a068p-1 (3f7cd034)
ulp =  1.00050 @ |a| =  9.99375939e-001 0x1.ffae34p-1 (3f7fd71a)

Based on the manner of generating the improved approximation there is no guarantee that this is a best approximation; no scientific breakthrough here. As the ulp error of the current solution is not yet perfectly balanced, and since continuing the search continues to deliver better approximations (albeit at exponentially increasing time intervals) my guess is that a 1 ulp error bound is achievable, but at the same time we seem to be very close to the best machine-optimized approximation already.

The better quality of the new approximation is the result of a refined search process. I observed that all of the largest ulp errors in the polynomial occur close to unity, say in [0.75,1.0] to be conservative. This allows for a fast scan for interesting coefficient sets whose maximum error is smaller than some bound, say 1.08 ulps. I can then test in detail and exhaustively all coefficient sets within a heuristically chosen hyper-cone anchored at that point. This second step searches for minimum ulp error as the primary goal, and maximum percentage of correctly rounded results as a secondary objective. By using this two-step process across all four cores of my CPU I was able to significantly speed up the search process: I have been able to check about 2²¹ coefficient sets so far.

Based on the range of each coefficient across all "close" solutions I now estimate that the total useful search space for this approximation problem is >= 2²⁴ coefficient sets rather than the more optimistic number of 2²⁰ I threw out before. This seems like a feasible problem to solve for someone who is either very patient or has lots of computational horse-power at their disposal.

My updated code is as follows:

// max ulp err = 1.01036
float atanf_poly (float a)
{
    float r, s;
    s = a * a;
    r =              0x1.7ed22cp-9f;
    r = fmaf (r, s, -0x1.0c2c2ep-6f);
    r = fmaf (r, s,  0x1.61fdf6p-5f);
    r = fmaf (r, s, -0x1.3556b4p-4f);
    r = fmaf (r, s,  0x1.b4e12ep-4f);
    r = fmaf (r, s, -0x1.230ae0p-3f);
    r = fmaf (r, s,  0x1.9978eep-3f);
    r = fmaf (r, s, -0x1.5554dap-2f);
    r = r * s;
    r = fmaf (r, a, a);
    return r;
}

// max ulp err = 1.51871
float my_atanf (float a)
{
    float r, t;
    t = fabsf (a);
    r = t;
    if (t > 1.0f) {
        r = 1.0f / r;
    }
    r = atanf_poly (r);
    if (t > 1.0f) {
        r = fmaf (0x1.ddcb02p-1f, 0x1.aee9d6p+0f, -r); // pi/2 - r
    }
    r = copysignf (r, a);
    return r;
}

Update (after revisiting the issue two-and-a-half years later)

Using T. Myklebust's draft publication as a starting point, I found the arctangent approximation on [-1,1] that has the smallest error to have a maximum error of 0.94528 ulp.

/* Based on: Tor Myklebust, "Computing accurate Horner form approximations 
   to special functions in finite precision arithmetic", arXiv:1508.03211,
   August 2015. maximum ulp err = 0.94528
*/
float atanf_poly (float a)
{
    float r, s;
    s = a * a;                        
    r =              0x1.6d2086p-9f;  //  2.78569828e-3
    r = fmaf (r, s, -0x1.03f2ecp-6f); // -1.58660226e-2
    r = fmaf (r, s,  0x1.5beebap-5f); //  4.24722321e-2
    r = fmaf (r, s, -0x1.33194ep-4f); // -7.49753043e-2
    r = fmaf (r, s,  0x1.b403a8p-4f); //  1.06448799e-1
    r = fmaf (r, s, -0x1.22f5c2p-3f); // -1.42070308e-1
    r = fmaf (r, s,  0x1.997748p-3f); //  1.99934542e-1
    r = fmaf (r, s, -0x1.5554d8p-2f); // -3.33331466e-1
    r = r * s;
    r = fmaf (r, a, a);
    return r;
}

Upvotes: 5

tmyklebu

Reputation: 14205

The following function is a faithfully-rounded implementation of arctan on [0, 1]:

float atan_poly (float a) {
  float s = a * a, u = fmaf(a, -a, 0x1.fde90cp-1f);
  float r1 =               0x1.74dfb6p-9f;
  float r2 = fmaf (r1, u,  0x1.3a1c7cp-8f);
  float r3 = fmaf (r2, s, -0x1.7f24b6p-7f);
  float r4 = fmaf (r3, u, -0x1.eb3900p-7f);
  float r5 = fmaf (r4, s,  0x1.1ab95ap-5f);
  float r6 = fmaf (r5, u,  0x1.80e87cp-5f);
  float r7 = fmaf (r6, s, -0x1.e71aa4p-4f);
  float r8 = fmaf (r7, u, -0x1.b81b44p-3f);
  float r9 = r8 * s;
  float r10 = fmaf (r9, a, a);
  return r10;
}

The following test harness will abort if the function atan_poly fails to be faithfully-rounded on [1e-16, 1] and print "success" otherwise:

int checkit(float f) {
  double d = atan(f);
  float d1 = d, d2 = d;
  if (d1 < d) d2 = nextafterf(d1, 1.0/0.0);
  else d1 = nextafterf(d1, -1.0/0.0);
  float p = atan_poly(f);
  if (p != d1 && p != d2) return 0;
  return 1;
}

int main() {
  for (float f = 1; f > 1e-16; f = nextafterf(f, -1.0/0.0)) {
    if (!checkit(f)) abort();
  }
  printf("success\n");
  exit(0);
}

The problem with using s in every multiplication is that the polynomial's coefficients do not decay rapidly. Inputs close to 1 result in lots and lots of cancellation of nearly equal numbers, meaning you're trying to find a set of coefficients so that the accumulated roundoff at the end of the computation closely approximates the residual of arctan.

The constant 0x1.fde90cp-1f is a number close to 1 for which (arctan(sqrt(x)) - x) / x^3 is very close to the nearest float. That is, it's a constant that goes into the computation of u so that the cubic coefficient is almost completely determined. (For this program, the cubic coefficient must be either -0x1.b81b44p-3f or -0x1.b81b42p-3f.)

Alternating multiplications by s and u has the effect of reducing the effect of roundoff error in ri upon r{i+2} by a factor of at most 1/4, since s*u < 1/4 whatever a is. This gives considerable leeway in choosing the coefficients of fifth order and beyond.

I found the coefficients with the aid of two programs:

One program plugs in a bunch of test points, writes down a system of linear inequalities, and computes bounds on the coefficients from that system of inequalities. Notice that, given a, one can compute the range of r8 that lead to a faithfully-rounded result. To get linear inequalities, I pretended r8 would be computed as a polynomial in the floats s and u in real-number arithmetic; the linear inequalities constrained this real-number r8 to lie in some interval. I used the Parma Polyhedra Library to handle these constraint systems.
Another program randomly tested sets of coefficients in certain ranges, plugging in first a set of test points and then all floats from 1 to 1e-8 in descending order and checking that atan_poly produces a faithful rounding of atan((double)x). If some x failed, it printed out that x and why it failed.

To get coefficients, I hacked this first program to fix c3, work out bounds on r7 for each test point, then get bounds on the higher-order coefficients. Then I hacked it to fix c3 and c5 and get bounds on the higher-order coefficients. I did this until I had all but the three highest-order coefficients, c13, c15, and c17.

I grew the set of test points in the second program until it either stopped printing anything out or printed out "success". I needed surprisingly few test points to reject almost all wrong polynomials---I count 85 test points in the program.

Here I show some of my work selecting the coefficients. In order to get a faithfully-rounded arctan for my initial set of test points assuming r1 through r8 are evaluated in real arithmetic (and rounded somehow unpleasantly but in a way I can't remember) but r9 and r10 are evaluated in float arithmetic, I need:

-0x1.b81b456625f15p-3 <= c3 <= -0x1.b81b416e22329p-3
-0x1.e71d48d9c2ca4p-4 <= c5 <= -0x1.e71783472f5d1p-4
0x1.80e063cb210f9p-5 <= c7 <= 0x1.80ed6efa0a369p-5
0x1.1a3925ea0c5a9p-5 <= c9 <= 0x1.1b3783f148ed8p-5
-0x1.ec6032f293143p-7 <= c11 <= -0x1.e928025d508p-7
-0x1.8c06e851e2255p-7 <= c13 <= -0x1.732b2d4677028p-7
0x1.2aff33d629371p-8 <= c15 <= 0x1.41e9bc01ae472p-8
0x1.1e22f3192fd1dp-9 <= c17 <= 0x1.d851520a087c2p-9

Taking c3 = -0x1.b81b44p-3, assuming r8 is also evaluated in float arithmetic:

-0x1.e71df05b5ad56p-4 <= c5 <= -0x1.e7175823ce2a4p-4
0x1.80df529dd8b18p-5 <= c7 <= 0x1.80f00e8da7f58p-5
0x1.1a283503e1a97p-5 <= c9 <= 0x1.1b5ca5beeeefep-5
-0x1.ed2c7cd87f889p-7 <= c11 <= -0x1.e8c17789776cdp-7
-0x1.90759e6defc62p-7 <= c13 <= -0x1.7045e66924732p-7
0x1.27eb51edf324p-8 <= c15 <= 0x1.47cda0bb1f365p-8
0x1.f6c6b51c50b54p-10 <= c17 <= 0x1.003a00ace9a79p-8

Taking c5 = -0x1.e71aa4p-4, assuming r7 is done in float arithmetic:

0x1.80e3dcc972cb3p-5 <= c7 <= 0x1.80ed1cf56977fp-5
0x1.1aa005ff6a6f4p-5 <= c9 <= 0x1.1afce9904742p-5
-0x1.ec7cf2464a893p-7 <= c11 <= -0x1.e9d6f7039db61p-7
-0x1.8a2304daefa26p-7 <= c13 <= -0x1.7a2456ddec8b2p-7
0x1.2e7b48f595544p-8 <= c15 <= 0x1.44437896b7049p-8
0x1.396f76c06de2ep-9 <= c17 <= 0x1.e3bedf4ed606dp-9

Taking c7 = 0x1.80e87cp-5, assuming r6 is done in float arithmetic:

0x1.1aa86d25bb64fp-5 <= c9 <= 0x1.1aca48cd5caabp-5
-0x1.eb6311f6c29dcp-7 <= c11 <= -0x1.eaedb032dfc0cp-7
-0x1.81438f115cbbp-7 <= c13 <= -0x1.7c9a106629f06p-7
0x1.36d433f81a012p-8 <= c15 <= 0x1.3babb57bb55bap-8
0x1.5cb14e1d4247dp-9 <= c17 <= 0x1.84f1151303aedp-9

Taking c9 = 0x1.1ab95ap-5, assuming r5 is done in float arithmetic:

-0x1.eb51a3b03781dp-7 <= c11 <= -0x1.eb21431536e0dp-7
-0x1.7fcd84700f7cfp-7 <= c13 <= -0x1.7ee38ee4beb65p-7
0x1.390fa00abaaabp-8 <= c15 <= 0x1.3b100a7f5d3cep-8
0x1.6ff147e1fdeb4p-9 <= c17 <= 0x1.7ebfed3ab5f9bp-9

I picked a point close to the middle of the range for c11 and randomly chose c13, c15, and c17.

EDIT: I've now automated this procedure. The following function is also a faithfully-rounded implementation of arctan on [0, 1]:

float c5 = 0x1.997a72p-3;
float c7 = -0x1.23176cp-3;
float c9 = 0x1.b523c8p-4;
float c11 = -0x1.358ff8p-4;
float c13 = 0x1.61c5c2p-5;
float c15 = -0x1.0b16e2p-6;
float c17 = 0x1.7b422p-9;

float juffa_poly (float a) {
  float s = a * a;
  float r1 =              c17;
  float r2 = fmaf (r1, s, c15);
  float r3 = fmaf (r2, s, c13);
  float r4 = fmaf (r3, s, c11);
  float r5 = fmaf (r4, s, c9);
  float r6 = fmaf (r5, s, c7);
  float r7 = fmaf (r6, s, c5);
  float r8 = fmaf (r7, s, -0x1.5554dap-2f);
  float r9 = r8 * s;
  float r10 = fmaf (r9, a, a);
  return r10;
}

I find it surprising that this code even exists. For coefficients near these, you can get a bound on the distance between r10 and the value of the polynomial evaluated in real arithmetic on the order of a few ulps thanks to the slow convergence of this polynomial when s is near 1. I had expected roundoff error to behave in a way that was fundamentally "untamable" simply by means of tweaking coefficients.

Upvotes: 6

Nominal Animal

Reputation: 39356

This is not an answer, but an extended comment too.

Recent Intel CPUs and some future AMD CPUs have AVX2. In Linux, look for avx2 flag in /proc/cpuinfo to see if your CPU supports these.

AVX2 is an extension that allows us to construct and compute using 256-bit vectors -- for example, eight single-precision numbers, or four double-precision numbers -- instead of just scalars. It includes FMA3 support, meaning fused multiply-add for such vectors. Simply put, AVX2 allows us to evaluate eight polynoms in parallel, in pretty much the same time as we evaluate a single one using scalar operations.

The function error8() analyses one set of coefficients, using predefined values of x, comparing against precalculated values of atan(x), and returns the error in ULPs (below and above the desired result separately), as well as the number of results that match the desired floating-point value exactly. These are not needed for simply testing whether a set of coefficients is better than the currently best known set, but allow different strategies on which coefficients to test. (Basically, the maximum error in ULPs forms a surface, and we're trying to find the lowest point on that surface; knowing the "height" of the surface at each point allows us to make educated guesses as to which direction to go -- how to change the coefficients.)

There are four precalculated tables used: known_x for the arguments, known_f for the correctly-rounded single-precision results, known_a for the double-precision "accurate" value (I'm just hoping the library atan() is precise enough for this -- but one should not rely on it without checking!), and known_m to scale the double-precision difference to ULPs. Given a desired range in arguments, the precalculate() function will precalculate these using the library atan() function. (It also relies on IEEE-754 floating-point formats and float and integer byte order being the same, but this is true on the CPUs this code runs on.)

Note that the known_x, known_f, and known_a arrays could be stored in binary files; the known_m contents are trivially derived from known_a. Using the library atan() without verifying it is not a good idea -- but because mine match njuffa's results, I didn't bother to look for a better reference atan().

For simplicity, here is the code in the form of an example program:

#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <immintrin.h>
#include <math.h>
#include <errno.h>

/** poly8() - Compute eight polynomials in parallel.
 * @x - the arguments
 * @c - the coefficients.
 *
 * The first coefficients are for degree 17, the second
 * for degree 15, and so on, down to degree 3.
 *
 * The compiler should vectorize the expression using vfmaddXXXps
 * given an AVX2-capable CPU; for example, Intel Haswell,
 * Broadwell, Haswell E, Broadwell E, Skylake, or Cannonlake;
 * or AMD Excavator CPUs. Tested on Intel Core i5-4200U.
 *
 * Using GCC-4.8.2 and
 *     gcc -O2 -march=core-avx2 -mtune=generic
 * this code produces assembly (AT&T syntax)
 *     vmulps       %ymm0, %ymm0, %ymm2
 *     vmovaps      (%rdi), %ymm1
 *     vmovaps      %ymm0, %ymm3
 *     vfmadd213ps  32(%rdi), %ymm2, %ymm1
 *     vfmadd213ps  64(%rdi), %ymm2, %ymm1
 *     vfmadd213ps  96(%rdi), %ymm2, %ymm1
 *     vfmadd213ps  128(%rdi), %ymm2, %ymm1
 *     vfmadd213ps  160(%rdi), %ymm2, %ymm1
 *     vfmadd213ps  192(%rdi), %ymm2, %ymm1
 *     vfmadd213ps  224(%rdi), %ymm2, %ymm1
 *     vmulps       %ymm2, %ymm1, %ymm0
 *     vfmadd132ps  %ymm3, %ymm3, %ymm0
 *     ret
 * if you omit the 'static inline'.
*/
static inline __v8sf poly8(const __v8sf x, const __v8sf *const c)
{
    const __v8sf xx = x * x;
    return (((((((c[0]*xx + c[1])*xx + c[2])*xx + c[3])*xx + c[4])*xx + c[5])*xx + c[6])*xx + c[7])*xx*x + x;
}

/** error8() - Calculate maximum error in ULPs
 * @x  - the arguments
 * @co - { C17, C15, C13, C11, C9, C7, C5, C3 }
 * @f  - the correctly rounded results in single precision
 * @a  - the expected results in double precision
 * @m  - 16777216.0 raised to the same power of two as @a normalized
 * @n  - number of vectors to test
 * @max_under - pointer to store the maximum underflow (negative, in ULPs) to
 * @max_over  - pointer to store the maximum overflow (positive, in ULPs) to
 * Returns the number of correctly rounded float results, 0..8*n.
*/
size_t error8(const __v8sf *const x, const float *const co,
              const __v8sf *const f, const __v4df *const a, const __v4df *const m,
              const size_t n,
              float *const max_under, float *const max_over)
{
    const __v8sf c[8] = { { co[0], co[0], co[0], co[0], co[0], co[0], co[0], co[0] },
                          { co[1], co[1], co[1], co[1], co[1], co[1], co[1], co[1] },
                          { co[2], co[2], co[2], co[2], co[2], co[2], co[2], co[2] },
                          { co[3], co[3], co[3], co[3], co[3], co[3], co[3], co[3] },
                          { co[4], co[4], co[4], co[4], co[4], co[4], co[4], co[4] },
                          { co[5], co[5], co[5], co[5], co[5], co[5], co[5], co[5] },
                          { co[6], co[6], co[6], co[6], co[6], co[6], co[6], co[6] },
                          { co[7], co[7], co[7], co[7], co[7], co[7], co[7], co[7] } };
    __v4df min = { 0.0, 0.0, 0.0, 0.0 };
    __v4df max = { 0.0, 0.0, 0.0, 0.0 };
    __v8si eqs = { 0, 0, 0, 0, 0, 0, 0, 0 };
    size_t i;

    for (i = 0; i < n; i++) {
        const __v8sf v = poly8(x[i], c);
        const __v4df d0 = { v[0], v[1], v[2], v[3] };
        const __v4df d1 = { v[4], v[5], v[6], v[7] };
        const __v4df err0 = (d0 - a[2*i+0]) * m[2*i+0];
        const __v4df err1 = (d1 - a[2*i+1]) * m[2*i+1];
        eqs -= (__v8si)_mm256_cmp_ps(v, f[i], _CMP_EQ_OQ);
        min = _mm256_min_pd(min, err0);
        max = _mm256_max_pd(max, err1);
        min = _mm256_min_pd(min, err1);
        max = _mm256_max_pd(max, err0);
    }

    if (max_under) {
        if (min[0] > min[1]) min[0] = min[1];
        if (min[0] > min[2]) min[0] = min[2];
        if (min[0] > min[3]) min[0] = min[3];
        *max_under = min[0];
    }

    if (max_over) {
        if (max[0] < max[1]) max[0] = max[1];
        if (max[0] < max[2]) max[0] = max[2];
        if (max[0] < max[3]) max[0] = max[3];
        *max_over = max[0];
    }

    return (size_t)((unsigned int)eqs[0])
         + (size_t)((unsigned int)eqs[1])
         + (size_t)((unsigned int)eqs[2])
         + (size_t)((unsigned int)eqs[3])
         + (size_t)((unsigned int)eqs[4])
         + (size_t)((unsigned int)eqs[5])
         + (size_t)((unsigned int)eqs[6])
         + (size_t)((unsigned int)eqs[7]);
}

/** precalculate() - Allocate and precalculate tables for error8().
 * @x0   - First argument to precalculate
 * @x1   - Last argument to precalculate
 * @xptr - Pointer to a __v8sf pointer for the arguments
 * @fptr - Pointer to a __v8sf pointer for the correctly rounded results
 * @aptr - Pointer to a __v4df pointer for the comparison results
 * @mptr - Pointer to a __v4df pointer for the difference multipliers
 * Returns the vector count if successful,
 * 0 with errno set otherwise.
*/
size_t precalculate(const float x0, const float x1,
                 __v8sf **const xptr, __v8sf **const fptr,
                 __v4df **const aptr, __v4df **const mptr)
{
    const size_t align = 64;
    unsigned int i0, i1;
    size_t       n, i, sbytes, dbytes;
    __v8sf      *x = NULL;
    __v8sf      *f = NULL;
    __v4df      *a = NULL;
    __v4df      *m = NULL;

    if (!xptr || !fptr || !aptr || !mptr) {
        errno = EINVAL;
        return (size_t)0;
    }

    memcpy(&i0, &x0, sizeof i0);
    memcpy(&i1, &x1, sizeof i1);

    i0 ^= (i0 & 0x80000000U) ? 0xFFFFFFFFU : 0x80000000U;
    i1 ^= (i1 & 0x80000000U) ? 0xFFFFFFFFU : 0x80000000U;

    if (i1 > i0)
        n = (((size_t)i1 - (size_t)i0) | (size_t)7) + (size_t)1;
    else
    if (i0 > i1)
        n = (((size_t)i0 - (size_t)i1) | (size_t)7) + (size_t)1;
    else {
        errno = EINVAL;
        return (size_t)0;
    }

    sbytes = n * sizeof (float);
    if (sbytes % align)
        sbytes += align - (sbytes % align);

    dbytes = n * sizeof (double);
    if (dbytes % align)
        dbytes += align - (dbytes % align);

    if (posix_memalign((void **)&x, align, sbytes)) {
        errno = ENOMEM;
        return (size_t)0;
    }
    if (posix_memalign((void **)&f, align, sbytes)) {
        free(x);
        errno = ENOMEM;
        return (size_t)0;
    }
    if (posix_memalign((void **)&a, align, dbytes)) {
        free(f);
        free(x);
        errno = ENOMEM;
        return (size_t)0;
    }
    if (posix_memalign((void **)&m, align, dbytes)) {
        free(a);
        free(f);
        free(x);
        errno = ENOMEM;
        return (size_t)0;
    }

    if (x1 > x0) {
        float *const xp = (float *)x;
        float        curr = x0;

        for (i = 0; i < n; i++) {
            xp[i] = curr;
            curr = nextafterf(curr, HUGE_VALF);
        }

        i = n;
        while (i-->0 && xp[i] > x1)
            xp[i] = x1;
    } else {
        float *const xp = (float *)x;
        float        curr = x0;

        for (i = 0; i < n; i++) {
            xp[i] = curr;
            curr = nextafterf(curr, -HUGE_VALF);
        }

        i = n;
        while (i-->0 && xp[i] < x1)
            xp[i] = x1;
    }

    {
        const float *const xp = (const float *)x;
        float *const       fp = (float *)f;
        double *const      ap = (double *)a;
        double *const      mp = (double *)m;

        for (i = 0; i < n; i++) {
            const float curr = xp[i];
            int         temp;

            fp[i] = atanf(curr);
            ap[i] = atan((double)curr);

            (void)frexp(ap[i], &temp);
            mp[i] = ldexp(16777216.0, temp);
        }
    }

    *xptr = x;
    *fptr = f;
    *aptr = a;
    *mptr = m;

    errno = 0;
    return n/8;
}

static int parse_range(const char *const str, float *const range)
{
    float fmin, fmax;
    char  dummy;

    if (sscanf(str, " %f %f %c",   &fmin, &fmax, &dummy) == 2 ||
        sscanf(str, " %f:%f %c",   &fmin, &fmax, &dummy) == 2 ||
        sscanf(str, " %f,%f %c",   &fmin, &fmax, &dummy) == 2 ||
        sscanf(str, " %f/%f %c",   &fmin, &fmax, &dummy) == 2 ||
        sscanf(str, " %ff %ff %c", &fmin, &fmax, &dummy) == 2 ||
        sscanf(str, " %ff:%ff %c", &fmin, &fmax, &dummy) == 2 ||
        sscanf(str, " %ff,%ff %c", &fmin, &fmax, &dummy) == 2 ||
        sscanf(str, " %ff/%ff %c", &fmin, &fmax, &dummy) == 2) {
        if (range) {
            range[0] = fmin;
            range[1] = fmax;
        }
        return 0;
    }

    if (sscanf(str, " %f %c",  &fmin, &dummy) == 1 ||
        sscanf(str, " %ff %c", &fmin, &dummy) == 1) {
        if (range) {
            range[0] = fmin;
            range[1] = fmin;
        }
        return 0;
    }

    return errno = ENOENT;
}

static int fix_range(float *const f)
{
    if (f && f[0] > f[1]) {
        const float tmp = f[0];
        f[0] = f[1];
        f[1] = tmp;
    }
    return f && isfinite(f[0]) && isfinite(f[1]) && (f[1] >= f[0]);
}

static const char *f2s(char *const buffer, const size_t size, const float value, const char *const invalid)
{
    char  format[32];
    float parsed;
    int   decimals, length;

    for (decimals = 0; decimals <= 16; decimals++) {
        length = snprintf(format, sizeof format, "%%.%df", decimals);
        if (length < 1 || length >= (int)sizeof format)
            break;

        length = snprintf(buffer, size, format, value);
        if (length < 1 || length >= (int)size)
            break;

        if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
            return buffer;

        decimals++;
    }

    for (decimals = 0; decimals <= 16; decimals++) {
        length = snprintf(format, sizeof format, "%%.%dg", decimals);
        if (length < 1 || length >= (int)sizeof format)
            break;

        length = snprintf(buffer, size, format, value);
        if (length < 1 || length >= (int)size)
            break;

        if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
            return buffer;

        decimals++;
    }

    length = snprintf(buffer, size, "%a", value);
    if (length < 1 || length >= (int)size)
        return invalid;

    if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
        return buffer;

    return invalid;
}

int main(int argc, char *argv[])
{
    float xrange[2] = { 0.75f, 1.00f };
    float c17range[2], c15range[2], c13range[2], c11range[2];
    float c9range[2], c7range[2], c5range[2], c3range[2];
    float c[8];

    __v8sf *known_x;
    __v8sf *known_f;
    __v4df *known_a;
    __v4df *known_m;
    size_t  known_n;

    if (argc != 10 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s C17 C15 C13 C11 C9 C7 C5 C3 x\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "Each of the coefficients can be a constant or a range,\n");
        fprintf(stderr, "for example 0.25 or 0.75:1. x must be a non-empty range.\n");
        fprintf(stderr, "\n");
        return EXIT_FAILURE;
    }

    if (parse_range(argv[1], c17range) || !fix_range(c17range)) {
        fprintf(stderr, "%s: Invalid C17 range or constant.\n", argv[1]);
        return EXIT_FAILURE;
    }
    if (parse_range(argv[2], c15range) || !fix_range(c15range)) {
        fprintf(stderr, "%s: Invalid C15 range or constant.\n", argv[2]);
        return EXIT_FAILURE;
    }
    if (parse_range(argv[3], c13range) || !fix_range(c13range)) {
        fprintf(stderr, "%s: Invalid C13 range or constant.\n", argv[3]);
        return EXIT_FAILURE;
    }
    if (parse_range(argv[4], c11range) || !fix_range(c11range)) {
        fprintf(stderr, "%s: Invalid C11 range or constant.\n", argv[4]);
        return EXIT_FAILURE;
    }
    if (parse_range(argv[5], c9range) || !fix_range(c9range)) {
        fprintf(stderr, "%s: Invalid C9 range or constant.\n", argv[5]);
        return EXIT_FAILURE;
    }
    if (parse_range(argv[6], c7range) || !fix_range(c7range)) {
        fprintf(stderr, "%s: Invalid C7 range or constant.\n", argv[6]);
        return EXIT_FAILURE;
    }
    if (parse_range(argv[7], c5range) || !fix_range(c5range)) {
        fprintf(stderr, "%s: Invalid C5 range or constant.\n", argv[7]);
        return EXIT_FAILURE;
    }
    if (parse_range(argv[8], c3range) || !fix_range(c3range)) {
        fprintf(stderr, "%s: Invalid C3 range or constant.\n", argv[8]);
        return EXIT_FAILURE;
    }

    if (parse_range(argv[9], xrange) || xrange[0] == xrange[1] ||
        !isfinite(xrange[0]) || !isfinite(xrange[1])) {
        fprintf(stderr, "%s: Invalid x range.\n", argv[9]);
        return EXIT_FAILURE;
    }

    known_n = precalculate(xrange[0], xrange[1], &known_x, &known_f, &known_a, &known_m);
    if (!known_n) {
        if (errno == ENOMEM)
            fprintf(stderr, "Not enough memory for precalculated tables.\n");
        else
            fprintf(stderr, "Invalid (empty) x range.\n");
        return EXIT_FAILURE;
    }

    fprintf(stderr, "Precalculated %lu arctangents to compare to.\n", 8UL * (unsigned long)known_n);
    fprintf(stderr, "\nC17 C15 C13 C11 C9 C7 C5 C3 max-ulps-under max-ulps-above correctly-rounded percentage cycles\n");
    fflush(stderr);

    {
        const double  percent = 12.5 / (double)known_n;
        size_t        rounded;
        char          c17buffer[64], c15buffer[64], c13buffer[64], c11buffer[64];
        char          c9buffer[64], c7buffer[64], c5buffer[64], c3buffer[64];
        char          minbuffer[64], maxbuffer[64];
        float         minulps, maxulps;
        unsigned long tsc_start, tsc_stop;

        for (c[0] = c17range[0]; c[0] <= c17range[1]; c[0] = nextafterf(c[0], HUGE_VALF))
        for (c[1] = c15range[0]; c[1] <= c15range[1]; c[1] = nextafterf(c[1], HUGE_VALF))
        for (c[2] = c13range[0]; c[2] <= c13range[1]; c[2] = nextafterf(c[2], HUGE_VALF))
        for (c[3] = c11range[0]; c[3] <= c11range[1]; c[3] = nextafterf(c[3], HUGE_VALF))
        for (c[4] = c9range[0]; c[4] <= c9range[1]; c[4] = nextafterf(c[4], HUGE_VALF))
        for (c[5] = c7range[0]; c[5] <= c7range[1]; c[5] = nextafterf(c[5], HUGE_VALF))
        for (c[6] = c5range[0]; c[6] <= c5range[1]; c[6] = nextafterf(c[6], HUGE_VALF))
        for (c[7] = c3range[0]; c[7] <= c3range[1]; c[7] = nextafterf(c[7], HUGE_VALF)) {
            tsc_start = __builtin_ia32_rdtsc();
            rounded = error8(known_x, c, known_f, known_a, known_m, known_n, &minulps, &maxulps);
            tsc_stop = __builtin_ia32_rdtsc();
            printf("%-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %lu %.3f %lu\n",
                   f2s(c17buffer, sizeof c17buffer, c[0], "?"),
                   f2s(c15buffer, sizeof c15buffer, c[1], "?"),
                   f2s(c13buffer, sizeof c13buffer, c[2], "?"),
                   f2s(c11buffer, sizeof c11buffer, c[3], "?"),
                   f2s(c9buffer, sizeof c9buffer, c[4], "?"),
                   f2s(c7buffer, sizeof c7buffer, c[5], "?"),
                   f2s(c5buffer, sizeof c5buffer, c[6], "?"),
                   f2s(c3buffer, sizeof c3buffer, c[7], "?"),
                   f2s(minbuffer, sizeof minbuffer, minulps, "?"),
                   f2s(maxbuffer, sizeof maxbuffer, maxulps, "?"),
                   rounded, (double)rounded * percent,
                   (unsigned long)(tsc_stop - tsc_start));
            fflush(stdout);

        }
    }

    return EXIT_SUCCESS;
}

The code does compile using GCC-4.8.2 on Linux, but might have to be modified for other compilers and/or OSes. (I'd be happy to include/accept edits fixing those, though. I just don't have Windows or ICC myself so I could check.)

To compile this, I recommend

gcc -Wall -O3 -fomit-frame-pointer -march=native -mtune=native example.c -lm -o example

Run without arguments to see usage; or

./example 0x1.7ed24ap-9f -0x1.0c2c12p-6f  0x1.61fdd2p-5f -0x1.3556b0p-4f 0x1.b4e138p-4f -0x1.230ae2p-3f  0x1.9978eep-3f -0x1.5554dap-2f 0.75:1

to check what it reports for njuffa's coefficient set, compared against standard C library atan() function, with all possible x in [0.75, 1] considered.

Instead of a fixed coefficient, you can also use min:max to define a range to scan (scanning all unique single-precision floating-point values). Each possible combination of the coefficients is tested.

Because I prefer decimal notation, but need to keep the values exact, I use the f2s() function to display the floating-point values. It is a simple brute-force helper function, that uses the shortest formatting that yields the same value when parsed back to float.

For example,

./example 0x1.7ed248p-9f:0x1.7ed24cp-9f -0x1.0c2c10p-6f:-0x1.0c2c14p-6f 0x1.61fdd0p-5f:0x1.61fdd4p-5f -0x1.3556aep-4f:-0x1.3556b2p-4f 0x1.b4e136p-4f:0x1.b4e13ap-4f -0x1.230ae0p-3f:-0x1.230ae4p-3f 0x1.9978ecp-3f:0x1.9978f0p-3f -0x1.5554d8p-2f:-0x1.5554dcp-2f 0.75:1

computes all the 6561 (3⁸) coefficient combinations ±1 ULP around njuffa's set for x in [0.75, 1]. (Indeed, it shows that decreasing C₁₇ by 1 ULP to 0x1.7ed248p-9f yields the exact same results.)

(That run took 90 seconds on Core i5-4200U at 2.6 GHz -- pretty much in line in my estimate of 30 coefficient sets per second per GHz per core. While this code is not threaded, the key functions are thread-safe, so threading should not be too difficult. This Core i5-4200U is a laptop, and gets pretty hot even when stressing just one core, so I didn't bother.)

(I consider the above code to be in public domain, or CC0-licensed where public domain dedication is not possible. In fact, I'm not sure if it is creative enough to be copyrightable at all. Anyway, feel free to use it anywhere in any way you wish, as long as you don't blame me if it breaks.)

Questions? Enhancements? Edits to fix Linux/GCC'isms are welcome!

Upvotes: 2

Pascal Cuoq

Reputation: 80276

This is not an answer to the question, but is too long to fit in a comment:

your question is about the optimal choice of coefficients C3, C5, …, C17 in a polynomial approximation to arctangent where you have pinned C1 to 1 and C2, C4, …, C16 to 0.

The title of your question says you are looking for approximations on [-1, 1], and a good reason to pin the even coefficients to 0 is that it is sufficient and necessary for the approximation to be exactly an odd function. The code in your question “contradicts” the title by applying the polynomial approximation only on [0, 1].

If you use the Remez algorithm to look for coefficients C2, C3, …, C8 to a polynomial approximation of arctangent on [0, 1] instead, you may end up with something like the values below:

#include <stdio.h>
#include <math.h>

float atan_poly (float a)
{
  float r, s;
  s = a;
  //  s = a * a;

  r =             -3.3507930064626076153585890630056286726807491543578e-2;
  r = fmaf (r, s, 1.3859776280052980081098065189344699108643282883702e-1);
  r = fmaf (r, s, -1.8186361916440430105127602496688553126414578766147e-1);
  r = fmaf (r, s, -1.4583047494913656326643327729704639191810926020847e-2);
  r = fmaf (r, s, 2.1335202878219865228365738728594741358740655881373e-1);
  r = fmaf (r, s, -3.6801711826027841250774413728610805847547996647342e-3);
  r = fmaf (r, s, -3.3289852243978319173749528028057608377028846413080e-1);
  r = fmaf (r, s, -1.8631479933914856903459844359251948006605218562283e-5);
  r = fmaf (r, s, 1.2917291732886065585264586294461539492689296797761e-7);

  r = fmaf (r, a, a);
  return r;
}

int main() {
  for (float x = 0.0f; x < 1.0f; x+=0.1f)
    printf("x: %f\n%a\n%a\n\n", x, atan_poly(x), atan(x));
}

This has roughly the same complexity as the code in your question—the number of multiplications is similar. Looking at this polynomial, there is no reason in particular to want to pin any coefficient to 0. If we wanted to approximate an odd function over [-1, 1] without pinning the even coefficients, they would automatically come up very small and subject to absorption, and then we would want to pin them to 0, but for this approximation over [0, 1], they don't, so we don't have to pin them.

It could have been better or worse than the odd polynomial in your question. It turns out that it is worse (see below). This quick-and-dirty application of LolRemez 0.2 (code included at the bottom of the question) seems to be, however, good enough to raise the question of the choice of coefficients. I would in particular be curious what happens if you subject the coefficients in this answer to the same “mixture of steepest decent and simulated annealing” optimization step that you applied to get the coefficients in your question.

So, to summarize this remark-posted-as-an-answer, are you sure that you are looking for optimal coefficients C3, C5, …, C17? It seems to me that you are looking for the best sequence of single-precision floating-point operations that produce a faithful approximation to arctangent, and that this approximation does not have to be the Horner form of a degree 17 odd polynomial.

x: 0.000000
0x0p+0
0x0p+0

x: 0.100000
0x1.983e2cp-4
0x1.983e28938f9ecp-4

x: 0.200000
0x1.94442p-3
0x1.94441ff1e8882p-3

x: 0.300000
0x1.2a73a6p-2
0x1.2a73a71dcec16p-2

x: 0.400000
0x1.85a37ap-2
0x1.85a3770ebe7aep-2

x: 0.500000
0x1.dac67p-2
0x1.dac670561bb5p-2

x: 0.600000
0x1.14b1dcp-1
0x1.14b1ddf627649p-1

x: 0.700000
0x1.38b116p-1
0x1.38b113eaa384ep-1

x: 0.800000
0x1.5977a8p-1
0x1.5977a686e0ffbp-1

x: 0.900000
0x1.773388p-1
0x1.77338c44f8faep-1

This is the code that I linked to LolRemez 0.2 in order to optimize the relative accuracy of a degree-9 polynomial approximation of arctangent on [0, 1]:

#include "lol/math/real.h"
#include "lol/math/remez.h"

using lol::real;
using lol::RemezSolver;

real f(real const &y)
{
  return (atan(y) - y) / y;
}

real g(real const &y)
{
  return re (atan(y) / y);
}

int main(int argc, char **argv)
{
  RemezSolver<8, real> solver;
  solver.Run("1e-1000", 1.0, f, g, 50);
  return 0;
}

Upvotes: 3

Best machine-optimized polynomial minimax approximation to arctangent on [-1,1]?

Answers (4)

Related Questions