Mark Miles
Mark Miles

Reputation: 706

Why is sinf() so slow when compiling my code on Linux/ARMv7?

The same C code with the math sin() function runs much slower when I compile my program on ARMv7 compared to what I get on Windows. I'm compiling with -O2 -Wall -Wextra -mfpu=neon -mtune=cortex-a9 -march=armv7 -std=c++11 and my gcc is gcc (Ubuntu/Linaro 4.8.2-19ubuntu1) 4.8.2.

I don't think it's just that sin() is not so fast for real time operations, and I know that a good compromise for faster sin functions would be using a lookup table, but what I'm experiencing here is probably an anomaly or a bug in the compiler, as it really takes too long to run the sin() function.

My program creates a few wavetables at the startup, and while it start almost instantly on Windows, it takes roughly 25-30 seconds to start on Linux/ARM...

Here's some code that shows where the sinf() function is used that slows down everything.

for (int n = 0; n < 73; ++n)
{
    // Max number of harmonics
    int hrm = int(16000.f / twf[n]);

    // Set vectors
    basic_wf.assign(wavelength[n], 0);

    for (int i = 0; i < wavelength[n]; ++i)
    {
        // Add harmonics
        for (int h = 1; h < hrm; ++h)
        {
            const float harm = 0.14f * (sinf((float)i * FACTOR * twf[n] * (float)h) / (float)h);
            if (h % 2 == 0) basic_wf[i] -= harm;    // add even negative harmonic
            else basic_wf[i] += harm;               // add odd positive harmonic
        }
    }
}

Here I'm filling 73 tables with a sawtooth waveform adding the required number of harmonig for each frequency. The lower the pitch of the note, the higher the number of harmonics (actual sin() calculations). This runs almost instantly on Windows... takes a lifetime on my Linux box.

Upvotes: 0

Views: 641

Answers (2)

njuffa
njuffa

Reputation: 26085

The code suggests, and your analysis in comments confirmed, that the magnitude of the argument to sinf() can get quite large, certainly up to several thousand. Accurate argument reduction used in common library implementations of trig functions can be computationally intensive and thus slow for large arguments, especially when the hardware platform has no support for fused-multiply add operations. This is likely a contributing factor in the low sinf() performance you observe.

You mentioned in a comment that the operands to sinf() include a factor of π. This indicates you would actually want to use sinpif(), where sinpi(x) = sin(x*π). The sinpi function was introduced in the IEEE-754 (2008) floating-point standard but has not made it into language standards yet. A few tool chains offer it as an extension, however. The advantage of sinpi() is that it requires only a very simple argument reduction regardless of the magnitude of the arguments, which can reduce excution time considerably. This results in improved performance. Since the multiplication by π is implicit, it can also offer improved accuracy over the discrete approach using sinf().

I am showing an exemplary C99 implementation of sinpif() below. Note that this code relies heavily on the standard math function fmaf() to achieve high processing speed and excellent accuracy. If your CPU does not have hardware support for the fused multiply-add (FMA) operation, this function will execute very slowly, since correct emulation of fmaf() is non-trivial. Since the code is written in modular fashion, you would want to configure the compiler to apply the maximum amount of function inlining, or add appropriate inlining attributes to all constituent functions.

As you indicate that your hardware platform does not offer native support for FMA, you can replace each fmaf(a,b,c) with (a*b+c), at some loss in accuracy. According to my tests, the maximum ulp error increases to 1.71364 ulps. This is still very good, but my_sinf() is no longer faithfully rounded anymore in that case, which is generally considered a desirable property.

/* Argument reduction for sinpi, cospi, sincospi. Reduces to [-0.25, +0.25] */
float trig_red_pi_f (float a, int *i)
{
    float r;
    r = rintf (a + a);
    *i = (int)r;
    r = a - 0.5f * r;
    return r;
}

/* Approximate cos(pi*x) for x in [-0.25,0.25]. Maximum ulp error = 0.87440 */
float cospif_poly (float s)
{
    float r;
    r =              0x1.d98dcep-3f;   //  2.31227502e-1f
    r = fmaf (r, s, -0x1.55c4e8p+0f);  // -1.33503580e+0f
    r = fmaf (r, s,  0x1.03c1d4p+2f);  //  4.05870533e+0f
    r = fmaf (r, s, -0x1.3bd3ccp+2f);  // -4.93480206e+0f
    r = fmaf (r, s,  0x1.000000p+0f);  //  1.00000000e+0f
    return r;
}

/* Approximate sin(pi*x) for x in [-0.25,0.25]. Maximum ulp error = 0.96441 */
float sinpif_poly (float a, float s)
{
    float r;
    r =             -0x1.2dc6f8p-1f;   // -5.89408636e-1f
    r = fmaf (r, s,  0x1.46602ep+1f);  //  2.54981017e+0f
    r = fmaf (r, s, -0x1.4abbc0p+2f);  // -5.16770935e+0f
    r = r * s;
    r = fmaf (r, a, -0x1.777a5cp-24f * a); // PI_lo // -8.74227766e-8f
    r = fmaf (a, 0x1.921fb6p+1f, r);       // PI_hi //  3.14159274e+0f
    return r;
}

/* Compute sin(pi*x) and cos(pi*x) based on quadrant */
float sinpif_cospif_core (float a, int i)
{
    float r, s;
    s = a * a;
    r = (i & 1) ? cospif_poly (s) : sinpif_poly (a, s);
    if (i & 2) {
        r = 0.0f - r; // don't change "sign" of NaNs or create negative zeros
    }
    return r;
}

/* maximum ulp error = 0.96411 */
float my_sinpif (float a)
{
    float r;
    int i;
    r = trig_red_pi_f (a, &i);
    r = sinpif_cospif_core (r, i);
    /* IEEE-754: sinPi(+n) is +0 and sinPi(-n) is -0 for positive integers n */
    r = (a == truncf (a)) ? (a * 0.0f) : r;
    return r;
}

Upvotes: 1

Lutz Lehmann
Lutz Lehmann

Reputation: 25992

You could do what already Napier and Co. did to compute logarithm tables -- or more precisely tables for the powers of 1.000001 or similar.

If you need a vector of values sin(k*w) then compute c1000=cos(1000*w) and s1000=sin(1000*w), set

c[0] = 1; s[0] = 0;
c[1000]=c1000; s[1000] = s1000;

and then iteratively

c[1000*(k+1)] = c1000*c[1000*k]-s1000*s[1000*k];
s[1000*(k+1)] = c1000*s[1000*k]+s1000*c[1000*k];

and then fill the gaps using c1=cos(w) and s1=sin(w) again using the trigonometric identities, 1000 steps forward or if you want 500 forward and 500 backward. This multi-level approach should keep the accumulation of floating point errors small enough.

On "big" processors there should be no large difference, the cost of 2 multiplications and a sincos evaluation should be comparable. In your case there should be some advantage in the multiplicative procedure.

Upvotes: 0

Related Questions