pattonalexo
pattonalexo

Reputation: 21

Looking for a way to speed up a function

I'm trying to speed up a big block of code across many files and found out that one function uses about 70% of the total time. This is because this function is called 477+ million times.

The pointer array par can only be one of two presets, either

par[0] = 0.057;
par[1] = 2.87;
par[2] = -3.;
par[3] = -0.03;
par[4] = -3.05;
par[5] = -3.5; 

OR

par[0] = 0.043;
par[1] = 2.92;
par[2] = -3.21;
par[3]= -0.065;
par[4] = -3.00;
par[5] = -2.65;

So I've tried plugging in numbers depending on which preset it is but have failed to find any significant time saves.

The pow and exp functions seem to be called about every time and they take up about 40 and 20 percent of the total time respectively, so only 10% of the total time is used by the parts of this function that aren't pow or exp. Finding ways to speed those up would probably be the best but none of the exponents used in pow are integers except -4 and I don't know if 1/(x*x*x*x) is faster than pow(x, -4).

double Signal::Param_RE_Tterm_approx(double Tterm, double *par) {

    double value = 0.;

    // time after Che angle peak
    if (Tterm > 0.) {

        if ( fabs(Tterm/ *par) >= 1.e-2) {
            value += -1./(*par)*exp(-1.*Tterm/(*par));

        }
        else {
            value += -1./par[0]*(1. - Tterm/par[0] + Tterm*Tterm/(par[0]*par[0]*2.) - Tterm*Tterm*Tterm/(par[0]*par[0]*par[0]*6.) );
        }

        if ( fabs(Tterm* *(par+1)) >= 1.e-2) {
            value += *(par+2)* *(par+1)*pow( 1.+*(par+1)*Tterm, *(par+2)-1. );

        }
        else {
            value += par[2]*par[1]*( 1.+(par[2]-1.)*par[1]*Tterm + (par[2]-1.)*(par[2]-1.-1.)/2.*par[1]*par[1]*Tterm*Tterm + (par[2]-1.)*(par[2]-1.-1.)*(par[2]-1.-2.)/6.*par[1]*par[1]*par[1]*Tterm*Tterm*Tterm );
        }

    }

    // time before Che angle peak
    else {

        if ( fabs(Tterm/ *(par+3)) >= 1.e-2 ) {
            value += -1./ *(par+3) *exp(-1.*Tterm/ *(par+3));

        }
        else {
             value += -1./par[3]*(1. - Tterm/par[3] + Tterm*Tterm/(par[3]*par[3]*2.) - Tterm*Tterm*Tterm/(par[3]*par[3]*par[3]*6.) );
        }

        if ( fabs(Tterm* *(par+4) >= 1.e-2 ) {
            value += *(par+5)* *(par+4) *pow( 1.+ *(par+4)*Tterm, *(par+5)-1. );

        }
        else {
             value += par[5]*par[4]*( 1.+(par[5]-1.)*par[4]*Tterm + (par[5]-1.)*(par[5]-1.-1.)/2.*par[4]*par[4]*Tterm*Tterm + (par[5]-1.)*(par[5]-1.-1.)*(par[5]-1.-2.)/6.*par[4]*par[4]*par[4]*Tterm*Tterm*Tterm );
        }
    }

    return value * 1.e9;

}

Upvotes: 2

Views: 202

Answers (4)

Alex Reinking
Alex Reinking

Reputation: 19816

If you want to explore batching / more optimization opportunities for fusing in computations that depend on these values, try using Halide

I've rewritten your program in Halide here:

#include <Halide.h>
using namespace Halide;

class ParamReTtermApproxOpt : public Generator<ParamReTtermApproxOpt>
{
public:
    Input<Buffer<float>> tterm{"tterm", 1};
    Input<Buffer<float>> par{"par", 1};
    Input<int> ncpu{"ncpu"};

    Output<Buffer<float>> output{"output", 1};

    Var x;
    Func par_inv;

    void generate() {
        // precompute 1 / par[x]
        par_inv(x) = fast_inverse(par(x));

        // after che peak
        Expr after_che_peak = tterm(x) > 0;

        Expr first_term = -par_inv(0) * fast_exp(-tterm(x) * par_inv(0));
        Expr second_term = par(2) * par(1) * fast_pow(1 + par(1) * tterm(x), par(2) - 1);

        // before che peak

        Expr third_term = -par_inv(3) * fast_exp(-tterm(x) * par_inv(3));
        Expr fourth_term = par(5) * par(4) * fast_pow(1 + par(4) * tterm(x), par(5) - 1);

        // final value

        output(x) = 1.e9f * select(after_che_peak, first_term + second_term,
                                                   third_term + fourth_term);
    }

    void schedule() {
        par_inv.bound(x, 0, 6);
        par_inv.compute_root();

        Var xo, xi;
        // break x into two loops, one for ncpu tasks
        output.split(x, xo, xi, output.extent() / ncpu)
        // mark the task loop parallel
              .parallel(xo)
        // vectorize each thread's computation for 8-wide vector lanes
              .vectorize(xi, 8);

        output.print_loop_nest();
    }
};

HALIDE_REGISTER_GENERATOR(ParamReTtermApproxOpt, param_re_tterm_approx_opt)

I can run 477,000,000 iterations in slightly over one second on my Surface Book (with ncpu=4). Batching is hugely important here since it enables vectorization.

Note that the equivalent program written using double arithmetic is much slower (20x) than float arithmetic. Though Halide doesn't supply fast_ versions for doubles, so this might not be quite apples-to-apples. Regardless, I would check whether you need the extra precision.

Upvotes: 0

Rick James
Rick James

Reputation: 142208

(Partial optimization:)

The longest expression has

  • Common subexpressions
  • Polynomial evaluated the costly way.

Pre-define these (perhaps add them to par[]):

a = par[5]*par[4];
b =   (par[5]-1.);
c = b*(par[5]-2.)/2.;
d = c*(par[5]-3.)/3.;

Then, for example, the longest expression becomes:

e = par[4]*Tterm;
value += a*(((d*e + c)*e + b)*e + 1.);

And simplify the rest.

If the expressions are curve-fitting approximations, why not do also with

value += -1./(*par)*exp(-1.*Tterm/(*par));

You should also ask whether all 477M iterations are needed.

Upvotes: 0

Yakk - Adam Nevraumont
Yakk - Adam Nevraumont

Reputation: 275220

I first rewrote it to be a bit easier to follow:

#include <math.h> 

double Param_RE_Tterm_approx(double Tterm, double const* par) {
  double value = 0.;

  if (Tterm > 0.) {
    // time after Che angle peak

    if ( fabs(Tterm/ par[0]) >= 1.e-2) {
      value += -1./(par[0])*exp(-1.*Tterm/(par[0]));
    } else {
      value += -1./par[0]*(1. - Tterm/par[0] + Tterm*Tterm/(par[0]*par[0]*2.) - Tterm*Tterm*Tterm/(par[0]*par[0]*par[0]*6.) );
    }

    if ( fabs(Tterm* par[1]) >= 1.e-2) {
      value += par[2]* par[1]*pow( 1.+par[1]*Tterm, par[2]-1. );
    } else {
      value += par[2]*par[1]*( 1.+(par[2]-1.)*par[1]*Tterm + (par[2]-1.)*(par[2]-1.-1.)/2.*par[1]*par[1]*Tterm*Tterm + (par[2]-1.)*(par[2]-1.-1.)*(par[2]-1.-2.)/6.*par[1]*par[1]*par[1]*Tterm*Tterm*Tterm );
    }

  } else {
    // time before Che angle peak

    if ( fabs(Tterm/ par[3]) >= 1.e-2 ) {
      value += -1./ par[3] *exp(-1.*Tterm/ par[3]);
    } else {
       value += -1./par[3]*(1. - Tterm/par[3] + Tterm*Tterm/(par[3]*par[3]*2.) - Tterm*Tterm*Tterm/(par[3]*par[3]*par[3]*6.) );
    }

    if ( fabs(Tterm* par[4]) >= 1.e-2 ) {
      value += par[5]* par[4] *pow( 1.+ par[4]*Tterm, par[5]-1. );

    } else {
       value += par[5]*par[4]*( 1.+(par[5]-1.)*par[4]*Tterm + (par[5]-1.)*(par[5]-1.-1.)/2.*par[4]*par[4]*Tterm*Tterm + (par[5]-1.)*(par[5]-1.-1.)*(par[5]-1.-2.)/6.*par[4]*par[4]*par[4]*Tterm*Tterm*Tterm );
    }
  }

  return value * 1.e9;
}

We can then look at its structure.

There are two main branches -- Tterm negative (before) and positive (after). These correspond to using 0,1,2 or 3,4,5 in the par array.

Then in each case we do two things to add to value. In both cases, for small cases we use a polynomial, and for big cases we use an exponential/power equation.

As a guess, this is because the polynomial is a decent approximation for the exponential for small values -- the error is acceptable. What you should do is confirm that guess -- take a look at the Taylor series expansion of the "big" power/exponent based equation, and see if it agrees with the polynomials somehow. Or check numerically.

If it is the case, this means that this equation has a known amount of error that is acceptable. Quite often there are faster versions of exp or pow that have a known amount of max error; consider using those.

If this isn't the case, there still could be an acceptable amount of error, but the Taylor series approximation can give you "in code" information about what is an acceptable amount of error.

A next step I'd take is to tear the 8 pieces of this equation apart. There is positive/negative, the first and second value+= in each branch, and then the polynomial/exponential case.

I'm guesing the fact that exp is taking ~1/3 the time of pow is because you have 3 calls to pow to 1 call to exp in your function, but you might find out something interesting like "all of our time is actually in the Tterm > 0. case" or what have you.

Now examine call sites. Is there a pattern in the Tterm you are passing this function? Ie, do you tend to pass Tterms in roughly sorted order? If so, you can do the test for which function to call outside of calling this function, and do it in batches.

Simply doing it in batches and compiling with optimization and inlining the bodies of the functions might make a surprising amount of difference; compilers are getting better at vectorizing work.

If that doesn't work, you can start threading things off. On a modern computer you can have 4-60 threads solving this problem independently, and this problem looks like you'd get nearly linear speedup. A basic threading library, like TBB, would be good for this kind of task.

For the next step up, if you are getting large batches of data and you need to do a lot of processing, you can stuff it onto a GPU and solve it there. Sadly, GPU<->RAM communication is small, so simply doing the math in this function on the GPU and reading/writing back and forth with RAM won't give you much if any performance. But if more work than just this can go on the GPU, it might be worth it.

Upvotes: 1

gimme_danger
gimme_danger

Reputation: 798

The only 10% of the total time is used by the parts of this function that aren't pow or exp.

If your function performance bottleneck is exp(), pow() execution, consider using vector instructions in your calculations. All modern processors support at least SSE2 instruction set, so this approach will definitely give at least ~2x speed up, because your calculation could be easily vectorized.

I recommend you to use this c++ vectorization library, which contains all standard mathematical functions (such as exp and pow) and allows to write code in OOP-style without using assembly language . I used it several times and it must work perfectly in your problem.

If you have GPU, you should also consider trying cuda framework, because, again, your problem could be perfectly vectorized. Moreover, If this function is called 477+ million times, GPU will literally eliminate your problem...

Upvotes: 0

Related Questions