I have a function like this in C (in pseudo-ish code, dropping the unimportant parts): int func(int s, int x, int *a, int *r) { int i; // do some stuff for (i = 0; i < a_really_big_int; ++i) { if (s) r[i] = x ^ i; else r[i] = x ^ a[i]; // and maybe a couple other ways of computing r // that are equally fast individually } // do some other stuff } This code gets called so much that this loop is actually a speed bottleneck in the code. I am wondering a couple things: Since the switch s is a constant in the function, will good compilers optimize the loop so that the branch isn't slowing things down all the time? If not, what is a good way to optimize this code? ==== EDIT: Here is an update with a fuller example: int func(int s, int start, int stop, int stride, double *x, double *b, int *a, int *flips, int *signs, int i_max, double *c) { int i,k,st; for (k=start; k<stop; k += stride) { b[k] = 0; for (i=0;i<i_max;++i) { /* this is the code in question */ if (s) st = k^flips[i]; else st = a[k]^flips[i]; /* done with code in question */ b[k] += x[st] * (__builtin_popcount(st & signs[i])%2 ? -c[i] : c[i]); } } } EDIT 2 : In case anyone is curious, I ended up refactoring the code and hoisting the whole inner for loop (with i_max ) outside, making the really_big_int loop be much simpler and hopefully easy to vectorize! (and also avoiding doing a bunch of extra logic a zillion times)

cloopsoptimizationcompiler-optimizationmicro-optimization

Reputation: 495

Compiler optimization of if (loop invariant) if statement inside a loop

I have a function like this in C (in pseudo-ish code, dropping the unimportant parts):

int func(int s, int x, int *a, int *r) {
    int i;

    // do some stuff

    for (i = 0; i < a_really_big_int; ++i) {
        if (s)
            r[i] = x ^ i;
        else
            r[i] = x ^ a[i];
        // and maybe a couple other ways of computing r
        // that are equally fast individually
    }

    // do some other stuff

}

This code gets called so much that this loop is actually a speed bottleneck in the code. I am wondering a couple things:

Since the switch s is a constant in the function, will good compilers optimize the loop so that the branch isn't slowing things down all the time?
If not, what is a good way to optimize this code?

====

EDIT: Here is an update with a fuller example:

int func(int s,
         int start, int stop, int stride,
         double *x, double *b,
         int *a, int *flips, int *signs, int i_max,
         double *c)
{
  int i,k,st;
  for (k=start; k<stop; k += stride) {
    b[k] = 0;
    for (i=0;i<i_max;++i) {

      /* this is the code in question */
      if (s) st = k^flips[i];
      else st = a[k]^flips[i];
      /* done with code in question */

      b[k] += x[st] * (__builtin_popcount(st & signs[i])%2 ? -c[i] : c[i]);
    }
  }
}

EDIT 2:

In case anyone is curious, I ended up refactoring the code and hoisting the whole inner for loop (with i_max) outside, making the really_big_int loop be much simpler and hopefully easy to vectorize! (and also avoiding doing a bunch of extra logic a zillion times)

Upvotes: 2

Answers (5)

technosaurus

Reputation: 7812

Assuming x86_64, you can ensure that the pointers are aligned to 16 bytes and use intrinsics. If it is only running on systems with AVX2, you could use the __mm256 variants (similar for avx512*)

int func(int s, int x, const __m128i* restrict a, __m128i* restrict r) {
    size_t i = 0, max = a_really_big_int / 4;
    __m128i xv =  _mm_set1_epi32(x);
    // do some stuff
    if (s) {
        __m128i iv = _mm_set_epi32(3,2,1,0); //or is it 0,1,2,3?
        __m128i four = _mm_set1_epi32(4);
        for ( ;i<max; ++i, iv=_mm_add_epi32(iv,four)) {
            r[i] = _mm_xor_si128(xv,iv);
        }
    }else{ /*not (s)*/
        for (;i<max;++i){
            r[i] = _mm_xor_si128(xv,a[i]);
        }
    }
    // do some other stuff   
}

Upvotes: 1

chux

Reputation: 154292

Micro-optimizations

Usually they are not worth the time - reviewing larger issue is more effective.

Yet to micro-optimize, trying a variety of approaches and then profiling them to find the best can make for modest improvements.

In addition to @wallyk and @kabanus fine answers, some simplistic compilers benefit with a loop that ends in 0.

// for (i=0;i<a_really_big_int;++i) {
for (i=a_really_big_int; --i; ) {

[edit 2nd optimization]

OP added a more compete example. One of the issues is that the compiler can not make assumption that that the memory pointed to by b and others do not overlap. This prevents certain optimizations.

Assuming they in fact to do not overlap, use restrict on b to allow optimizations. const helps too for weaker compilers that do no deduce that. restrict on the others may also benefit, again, if the reference data does not overlap.

// int func(int s, int start, int stop, int stride, double *x,
//     double *b, int *a, int *flips,
//     int *signs, int i_max, double *c) {

int func(int s, int start, int stop, int stride, const double * restrict x,
    double * restrict b, const int * restrict a, const int * restrict flips, 
    const int * restrict signs, int i_max, double *c) {

Upvotes: 3

Myst

Reputation: 19221

Although the if statement will be optimized away on any decent compiler (unless you asked the compiler not to optimize), I would consider writing the optimization in (just in case you compile without optimizations).

In addition, although the compiler might optimize the "absolute" if statement, I would consider optimizing it manually, either using any available builtin, or using bitwise operations.

i.e.

b[k] += x[st] *
        ( ((__builtin_popcount(st & signs[I]) & 1) *
           ((int)0xFFFFFFFFFFFFFFFF)) ^c[I] );

This will take the last bit of popcount (1 == odd, 0 == even), multiply it by the const (all bits 1 if odd, all bits 0 if true) and than XOR the c[I] value (which is the same as 0-c[I] or ~(c[I]).

This will avoid instruction jumps in cases where the second absolute if statement isn't optimized.

P.S.

I used an 8 byte long value and truncated it's length by casting it to an int. This is because I have no idea how long an int might be on your system (it's 4 bytes on mine, which is 0xFFFFFFFF).

Upvotes: 0

kabanus

Reputation: 26005

All your commands are quick O(1) command in the loop. The if is definitely optimized, so is your for+if if all your commands are of the form r[i]=somethingquick. The question may boil down for you on how small can big int be?

A quick int main that just goes from INT_MIN to INT_MAX summing into a long variable, takes ~10 seconds for me on the Ubuntu subsystem on Windows. Your commands may multiply this by a few, which quickly gets to a minute. Bottom line, this may be not be avoidable if you really are iterating a ton.

If r[i] are calculated independently, this would be a classic usage for threading/multi-processing.

EDIT:

I think % is optimized anyway by the compiler, but if not, take care that x & 1 is much faster for an odd/even check.

Upvotes: 1

wallyk

Reputation: 57794

One obvious way to optimize the code is pull the conditional outside the loop:

if (s)
    for (i=0;i<a_really_big_int;++i) {
        r[i] = x ^ i;
    }
else
    for (i=0;i<a_really_big_int;++i) {
        r[i] = x ^ a[i];
    }

A shrewd compiler might be able to change that into r[] assignments of more than one element at a time.

Upvotes: 5

Compiler optimization of if (loop invariant) if statement inside a loop

Answers (5)

Related Questions