Reputation: 2960

Why is this seemingly slower C loop actually twice as fast as the other way?

I'm an R developer who uses C for algorithmic purposes and have a question about why a C loop that seems like it would be slow is actually faster than the alternative approach.

In R, our Boolean type can actually have three values, true, false, and na, and we represent this using an int at the C level.

I'm looking into a vectorized && operation (yes, we have this in R already, but bear with me) that also handles the na case. The scalar results would look like this:

 F && F == F
 F && T == F
 F && N == F

 T && F == F
 T && T == T
 T && N == N

 N && F == F
 N && T == N
 N && N == N

Note that it works like && in C, except that na values propagate when combined with anything except false, in which case we "know" that && can never be true, so we return false.

Now to the implementation. Assume we have two vectors, v_out and v_x, and we'd like to perform the vectorized && on them. We are allowed to overwrite v_out with the result. One option is:

// Option 1
for (int i = 0; i < size; ++i) {
  int elt_out = v_out[i];
  int elt_x = v_x[i];

  if (elt_out == 0) {
    // Done
  } else if (elt_x == 0) {
    v_out[i] = 0;
  } else if (elt_out == na) {
    // Done
  } else if (elt_x == na) {
    v_out[i] = na;
  }
}

And another option is:

// Option 2
for (int i = 0; i < size; ++i) {
  int elt_out = v_out[i];

  if (elt_out == 0) {
    continue;
  }

  int elt_x = v_x[i];

  if (elt_x == 0) {
    v_out[i] = 0;
  } else if (elt_out == na) {
    // Done
  } else if (elt_x == na) {
    v_out[i] = na;
  }
}

I sort of expected the second option to be faster, because it avoids accessing v_x[i] when it isn't required. But in fact it was twice as slow when compiled with -O2!

In the following script, I get the following timing results. Note that I am on a Mac and compile with Clang.

It seems reasonable with O0. They are about the same.
2x faster with O2 with Option 1!

Option 1, `clang -O0`
0.110560

Option 2, `clang -O0`
0.107710

Option 1, `clang -O2`
0.032223

Option 2, `clang -O2`
0.070557

What is going on here? My best guess is that it has something to do with the fact that in Option 1 v_x[i] is always being accessed linearly, which is extremely fast. But in Option 2, v_x[i] is essentially being accessed randomly (sort of), because it might access v_x[10], but then not need another element from v_x until v_x[120], and because that access isn't linear, it is probably much slower.

Reproducible script:

#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#include <time.h>

int main() {
  srand(123);

  int size = 1e7;
  int na = INT_MIN;

  int* v_out = (int*) malloc(size * sizeof(int));
  int* v_x = (int*) malloc(size * sizeof(int));

  // Generate random numbers between 1-3
  // 1 -> false
  // 2 -> true
  // 3 -> na
  for (int i = 0; i < size; ++i) {
    int elt_out = rand() % 3 + 1;

    if (elt_out == 1) {
      v_out[i] = 0;
    } else if (elt_out == 2) {
      v_out[i] = 1;
    } else {
      v_out[i] = na;
    }

    int elt_x = rand() % 3 + 1;

    if (elt_x == 1) {
      v_x[i] = 0;
    } else if (elt_x == 2) {
      v_x[i] = 1;
    } else {
      v_x[i] = na;
    }
  }

  clock_t start = clock();

  // Option 1
  for (int i = 0; i < size; ++i) {
    int elt_out = v_out[i];
    int elt_x = v_x[i];

    if (elt_out == 0) {
      // Done
    } else if (elt_x == 0) {
      v_out[i] = 0;
    } else if (elt_out == na) {
      // Done
    } else if (elt_x == na) {
      v_out[i] = na;
    }
  }

  // // Option 2
  // for (int i = 0; i < size; ++i) {
  //   int elt_out = v_out[i];
  //
  //   if (elt_out == 0) {
  //     continue;
  //   }
  //
  //   int elt_x = v_x[i];
  //
  //   if (elt_x == 0) {
  //     v_out[i] = 0;
  //   } else if (elt_out == na) {
  //     // Done
  //   } else if (elt_x == na) {
  //     v_out[i] = na;
  //   }
  // }

  clock_t end = clock();
  double time = (double) (end - start) / CLOCKS_PER_SEC;

  free(v_out);
  free(v_x);

  printf("%f\n", time);
  return 0;
}

Based on a few questions in the comments, here are a few points of clarifications for future readers:

I am on a 2018 15-inch MacBook Pro with a 2.9 GHz 6-core Intel i9-8950HK (6 core Coffee Lake.)
My particular Clang version that I tested with is Apple clang version 13.1.6 (clang-1316.0.21.2.5) with Target: x86_64-apple-darwin21.6.0
I am restricted by R to use int as the data type (even though there are more efficient options) and the following coding: false = 0, true = 1, na = INT_MIN. The reproducible example that I provided respects this.
The original question was not actually a request to make the code run faster. I just wanted to get an idea of what the difference was between my two if/else approaches. That said, some answers have shown that branchless approaches can be much faster, and I really appreciate the explanations that those users have provided! That has greatly influenced the final version of the implementation I am working on.

Upvotes: 62

Answers (4)

Davislor

Reputation: 15164

Let’s take a look at what these code samples compile to, on Clang 15.0.0 with -std=c17 -O3 -march=x86-64-v3. Other compilers will generate slightly different code; it’s finicky.

Factoring out your code snippets into functions, we get

#include <limits.h>
#include <stddef.h>

#define na INT_MIN

int* filter1( const size_t size,
              int v_out[size],
              const int v_x[size]
            )
{
  for ( size_t i = 0; i < size; ++i) {
    int elt_out = v_out[i];
    int elt_x = v_x[i];

    if (elt_out == 0) {
      // Done
    } else if (elt_x == 0) {
      v_out[i] = 0;
    } else if (elt_out == na) {
      // Done
    } else if (elt_x == na) {
      v_out[i] = na;
    }
  }
  return v_out;
}


int* filter2( const size_t size,
              int v_out[size],
              const int v_x[size]
            )
{
for (int i = 0; i < size; ++i) {
  int elt_out = v_out[i];

  if (elt_out == 0) {
    continue;
  }

  int elt_x = v_x[i];

  if (elt_x == 0) {
    v_out[i] = 0;
  } else if (elt_out == na) {
    // Done
  } else if (elt_x == na) {
    v_out[i] = na;
  }
}
  return v_out;
}

Your option 1, filter1 here, compiles to a vectorized loop on Clang 15. (GCC 12 has trouble with it.) The loop body here compiles to:

.LBB0_8:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm3, ymmword ptr [r10 + 4*rsi - 32]
        vmovdqu ymm4, ymmword ptr [rdx + 4*rsi]
        vpcmpeqd        ymm5, ymm3, ymm0
        vpcmpeqd        ymm6, ymm4, ymm0
        vpxor   ymm7, ymm6, ymm1
        vpcmpgtd        ymm3, ymm3, ymm2
        vpcmpeqd        ymm4, ymm4, ymm2
        vpand   ymm3, ymm3, ymm4
        vpandn  ymm4, ymm5, ymm6
        vpandn  ymm5, ymm5, ymm7
        vpand   ymm3, ymm5, ymm3
        vpand   ymm5, ymm3, ymm2
        vpor    ymm3, ymm3, ymm4
        vpmaskmovd      ymmword ptr [r10 + 4*rsi - 32], ymm3, ymm5
        vmovdqu ymm3, ymmword ptr [r10 + 4*rsi]
        vmovdqu ymm4, ymmword ptr [rdx + 4*rsi + 32]
        vpcmpeqd        ymm5, ymm3, ymm0
        vpcmpeqd        ymm6, ymm4, ymm0
        vpxor   ymm7, ymm6, ymm1
        vpcmpgtd        ymm3, ymm3, ymm2
        vpcmpeqd        ymm4, ymm4, ymm2
        vpand   ymm3, ymm3, ymm4
        vpandn  ymm4, ymm5, ymm6
        vpandn  ymm5, ymm5, ymm7
        vpand   ymm3, ymm5, ymm3
        vpand   ymm5, ymm3, ymm2
        vpor    ymm3, ymm3, ymm4
        vpmaskmovd      ymmword ptr [r10 + 4*rsi], ymm3, ymm5
        add     rsi, 16
        add     r9, -2
        jne     .LBB0_8

So we see the compiler optimized the loop to a series of SIMD comparisons (vpcmpeqd instructions) to generate a bitmask that it will then use to do conditional moves with vpmaskmovd. This looks more complex than it really is, because it’s partly unrolled to do two consecutive updates per iteration.

You will note that there are no branches, other than the test at the bottom of the loop for whether we are at the end of the array. However, because of conditional moves, we can sometimes get a cache miss on a load or a store. That is what I think sometimes happened in my tests.

Now let’s look at option 2:

.LBB1_8:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm3, ymmword ptr [r10 + 4*rsi - 32]
        vpcmpeqd        ymm4, ymm3, ymm0
        vpxor   ymm5, ymm4, ymm1
        vpmaskmovd      ymm5, ymm5, ymmword ptr [r11 + 4*rsi - 32]
        vpcmpeqd        ymm6, ymm5, ymm0
        vpxor   ymm7, ymm6, ymm1
        vpcmpgtd        ymm3, ymm3, ymm2
        vpcmpeqd        ymm5, ymm5, ymm2
        vpand   ymm3, ymm3, ymm5
        vpandn  ymm5, ymm4, ymm6
        vpandn  ymm4, ymm4, ymm7
        vpand   ymm3, ymm4, ymm3
        vpand   ymm4, ymm3, ymm2
        vpor    ymm3, ymm3, ymm5
        vpmaskmovd      ymmword ptr [r10 + 4*rsi - 32], ymm3, ymm4
        vmovdqu ymm3, ymmword ptr [r10 + 4*rsi]
        vpcmpeqd        ymm4, ymm3, ymm0
        vpxor   ymm5, ymm4, ymm1
        vpmaskmovd      ymm5, ymm5, ymmword ptr [r11 + 4*rsi]
        vpcmpeqd        ymm6, ymm5, ymm0
        vpxor   ymm7, ymm6, ymm1
        vpcmpgtd        ymm3, ymm3, ymm2
        vpcmpeqd        ymm5, ymm5, ymm2
        vpand   ymm3, ymm3, ymm5
        vpandn  ymm5, ymm4, ymm6
        vpandn  ymm4, ymm4, ymm7
        vpand   ymm3, ymm4, ymm3
        vpand   ymm4, ymm3, ymm2
        vpor    ymm3, ymm3, ymm5
        vpmaskmovd      ymmword ptr [r10 + 4*rsi], ymm3, ymm4
        add     rsi, 16
        add     r9, -2
        jne     .LBB1_8

Similar code on this compiler, but slightly longer. One difference is a conditional move from the v_x vector.

However, that is with -march=x86-64-v3. If you don’t tell it it’s allowed to use AVX2 instructions, such as vpmaskmovd, Clang 15.0.0 will give up on vectorizing this version of the algorithm at all.

For comparison, we could refactor this code, taking advantage of the fact that the updated value of v_out[i] will always be equal to either v_out[i] or v_x[i]:

int* filter3( const size_t size,
              int v_out[size],
              const int v_x[size]
            )
{
  for ( size_t i = 0; i < size; ++i) {
    const int elt_out = v_out[i];
    const int elt_x = v_x[i];

    v_out[i] = (elt_out == 0)  ? elt_out :
               (elt_x == 0)    ? elt_x :
               (elt_out == na) ? elt_out :
               (elt_x == na)   ? elt_x :
                                 elt_out;
  }
  return v_out;
}

And this gets us some very different code:

.LBB2_7:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm6, ymmword ptr [rax + 4*rsi]
        vmovdqu ymm4, ymmword ptr [rax + 4*rsi + 32]
        vmovdqu ymm3, ymmword ptr [rax + 4*rsi + 64]
        vmovdqu ymm2, ymmword ptr [rax + 4*rsi + 96]
        vmovdqu ymm7, ymmword ptr [rdx + 4*rsi]
        vmovdqu ymm8, ymmword ptr [rdx + 4*rsi + 32]
        vmovdqu ymm9, ymmword ptr [rdx + 4*rsi + 64]
        vmovdqu ymm5, ymmword ptr [rdx + 4*rsi + 96]
        vpcmpeqd        ymm10, ymm6, ymm0
        vpcmpeqd        ymm11, ymm4, ymm0
        vpcmpeqd        ymm12, ymm3, ymm0
        vpcmpeqd        ymm13, ymm2, ymm0
        vpcmpeqd        ymm14, ymm7, ymm0
        vpor    ymm10, ymm10, ymm14
        vpcmpeqd        ymm14, ymm8, ymm0
        vpor    ymm11, ymm11, ymm14
        vpcmpeqd        ymm14, ymm9, ymm0
        vpor    ymm12, ymm12, ymm14
        vpcmpeqd        ymm14, ymm5, ymm0
        vpcmpeqd        ymm7, ymm7, ymm1
        vblendvps       ymm7, ymm6, ymm1, ymm7
        vpor    ymm13, ymm13, ymm14
        vpcmpeqd        ymm6, ymm6, ymm1
        vpandn  ymm6, ymm10, ymm6
        vpandn  ymm7, ymm10, ymm7
        vpcmpeqd        ymm8, ymm8, ymm1
        vblendvps       ymm8, ymm4, ymm1, ymm8
        vpcmpeqd        ymm4, ymm4, ymm1
        vpcmpeqd        ymm9, ymm9, ymm1
        vblendvps       ymm9, ymm3, ymm1, ymm9
        vpandn  ymm4, ymm11, ymm4
        vpandn  ymm8, ymm11, ymm8
        vpcmpeqd        ymm3, ymm3, ymm1
        vpandn  ymm3, ymm12, ymm3
        vpandn  ymm9, ymm12, ymm9
        vpcmpeqd        ymm5, ymm5, ymm1
        vblendvps       ymm5, ymm2, ymm1, ymm5
        vpcmpeqd        ymm2, ymm2, ymm1
        vpandn  ymm2, ymm13, ymm2
        vpandn  ymm5, ymm13, ymm5
        vblendvps       ymm6, ymm7, ymm1, ymm6
        vblendvps       ymm4, ymm8, ymm1, ymm4
        vblendvps       ymm3, ymm9, ymm1, ymm3
        vblendvps       ymm2, ymm5, ymm1, ymm2
        vmovups ymmword ptr [rax + 4*rsi], ymm6
        vmovups ymmword ptr [rax + 4*rsi + 32], ymm4
        vmovups ymmword ptr [rax + 4*rsi + 64], ymm3
        vmovups ymmword ptr [rax + 4*rsi + 96], ymm2
        add     rsi, 32
        cmp     r11, rsi
        jne     .LBB2_7

Although this looks longer, this is updating four vectors on each iteration, and is in fact blending the v_out and v_x vectors with a bitmask. The GCC 12.2 version of this loop follows similar logic with one update per iteration, so is more concise:

.L172:
        vmovdqu ymm3, YMMWORD PTR [rcx+rax]
        vpcmpeqd        ymm0, ymm2, YMMWORD PTR [rsi+rax]
        vpcmpeqd        ymm1, ymm3, ymm2
        vpcmpeqd        ymm6, ymm3, ymm4
        vpcmpeqd        ymm0, ymm0, ymm2
        vpcmpeqd        ymm1, ymm1, ymm2
        vpand   ymm0, ymm0, ymm1
        vpcmpeqd        ymm1, ymm4, YMMWORD PTR [rsi+rax]
        vpor    ymm1, ymm1, ymm6
        vpand   ymm6, ymm0, ymm1
        vpandn  ymm1, ymm1, ymm0
        vpxor   ymm0, ymm0, ymm5
        vpblendvb       ymm0, ymm3, ymm2, ymm0
        vpblendvb       ymm0, ymm0, ymm3, ymm1
        vpblendvb       ymm0, ymm0, ymm4, ymm6
        vmovdqu YMMWORD PTR [rcx+rax], ymm0
        add     rax, 32
        cmp     rdx, rax
        jne     .L172

This, as you see, is roughly as tight as a rolled-up version of 1 and 3 that did one update per iteration, but some optimizers seem to have less trouble with it. A similar version, whose code differs mainly in register allocations, would be:

int* filter4( const size_t size,
              int v_out[size],
              const int v_x[size]
            )
{
  for ( size_t i = 0; i < size; ++i) {
    const int elt_out = v_out[i];
    const int elt_x = v_x[i];

    v_out[i] = (elt_out == 0)  ? 0 :
               (elt_x == 0)    ? 0 :
               (elt_out == na) ? na :
               (elt_x == na)   ? na :
                                 elt_out;
  }
  return v_out;
}

The Takeaway

What seems to have happened is that your compiler was able to vectorize your version 1, but not your version 2, on the settings you were using. If it can vectorize both, they perform similarly.

In 2022, a compiler with aggressive optimization settings can turn any of these loops into vectorized branchless code, at least if you enable AVX2. If you do, the second version is capable, as you thought, of loading from v_x conditionally. (This does lead to a big observable difference when you initialize v_out to all-zeroes.) Compilers in 2022 seem to do better with the single assignment statements of version 3 and 4 than the if blocks of 1 and 2. They vectorize on some targets and settings on which 1 and 2 do not, and even when all four do, Clang 15.0.0 unrolls 3 and 4 more aggressively than 1 and 2.

With AVX-512 instructions enabled, the compiler can optimize all four versions to similar branchless code, and there isn't any significant difference in performance. With other targets (in particular, -O3 -march=x86-64-v2 and -O3 -march=x86-64-v3), Clang 15.0.0 does significantly better with versions 3 and_ 4 than 1 and 2.

However, if you are willing to change the behavior of the function for some inputs, you can remove the comparisons and conditional moves for further speedup, as in Peter Cordes’ and Karl Knechtel's answers. Here, I wanted to compare like to like.

In my testing, which version was faster depended heavily on what the input values were initialized to. With the same random seed you used, filter1 was slightly faster than the other three, but with truly randomized data, any of the four could be faster.

Upvotes: 12

Peter Cordes

Reputation: 365812

If you want fast vectorized code, don't do short-circuit evaluation, and don't branch in general. You want the compiler to be able to do 16 or 32 elements at once with SIMD operations, using 8-bit elements. (Compilers can optimize ifs to branchless code if it's safe to do the work unconditionally, including dereferences, and there aren't side-effects. This is called if-conversion, and is typically necessary for auto-vectorization of code like this.)

And you don't want the compiler to worry that it's not allowed to touch some memory because the C abstract machine doesn't. E.g., if all the v_out[i] elements are false, the v_x might be a NULL pointer without causing UB! So the compiler can't invent read access to objects that the C logic doesn't read at all.

If v_x were truly an array, not just a pointer, the compiler would know that it's readable and would be allowed to invent accesses to it by doing if-conversion of the short-circuit logic to branchless. But if its cost heuristics don't see a really big benefit (like auto-vectorization), it might choose not to. Branchy code will often in practice be slower with a random mix of trues and falses (and NAs).

As you can see in the compiler's assembly output (Clang 15 -O2 on Compiler Explorer), option 1 auto-vectorizes with SIMD, branchlessly handling 4 optional-bools in parallel (with just SSE2, more with -march=native). (Thanks to Richard in comments for cooking up the Compiler Explorer link; it's probably reflective of what Apple Clang will do to your real code in main.)

Your 3-state bool that supports an NA state can be implemented with 2 bits, in a way that bitwise AND does your && operation. You can store arrays of it as one per unsigned char, or packed 4 per char to quadruple your throughput for vectorized operations, at the cost of slower access. (Or in general CHAR_BIT/2 per char, but on mainstream C implementations for x86 that's 4.)

F = 00
N = 10 (in binary, so C 0b10 aka 2)
T = 11
conversion to bool with val & 1.
conversion from bool with 0b11 * b or something to broadcast the low bit to both positions.

F & anything = 0 because F is all-zero bits. N&N == N; that's trivially true for any bit-pattern. The "clever" part is that N&T = T&N = N, since the set bits in T are a superset of those in N.

This also works for || with bitwise |: F|N == N and F|T == T because 0|x == x. Also x|x == x for any same input so we're still fine there.

N = 0b10 won't set the low bit when ORing, but will clear it when ANDing.

I forgot you said C instead of C++, so this bare-bones class wrapper (only sufficient to demo a few test callers) might not be relevant, but a loop doing c1[i] &= c2[i]; in plain C for unsigned char *c1, *c2 will auto-vectorize exactly the same way.

struct NBool{ // Nullable bool, should probably rename to optional bool
    unsigned char val;
    static const unsigned char F = 0b00;
    static const unsigned char T = 0b11;
    static const unsigned char N = 0b10;  // N&T = N;  N&N = N;  N&F = F

    auto operator &=(NBool rhs){   // define && the same way if you want, as non-short-circuiting
        val &= rhs.val;
        return *this;
    }
    operator bool() { return val & 1; }

    constexpr NBool(unsigned char x) : val(x) {};
    constexpr NBool& operator=(const NBool &) = default;

};

#include <stdint.h>
#include <stdlib.h>

bool test(NBool a){
    return a;
}

bool test2(NBool a){
    NBool b = NBool::F;
    return a &= b;   // return false
}


void foo(size_t len, NBool *a1, NBool *a2 )
{
    for (std::size_t i = 0 ; i < len ; i++){
        a1[i] &= a2[i];
    }
}

(I think "Nullable" isn't really correct terminology for something that can have be NaN / NA; it's always safe to read, and it's not a reference in the first place. Maybe optional_bool, like C++ std::optional which is a value that may or may not be present.)

This compiles on Compiler Explorer with GCC and Clang. Clang auto-vectorizes fairly nicely with an unrolled loop doing vandps. (A bit of a weird choice by clang; on -march=haswell, vpand has better throughput.) But it is still limited by 1/clock store and 2/clock load anyway; this very much bottlenecks on load/store with such low computational intensity, even if the data is hot in L1d cache.

(Intel's optimization manual says that even though Skylake's peak L1d bandwidth is 2 loads + 1 store per clock (96 bytes with 32-byte vectors), the sustained bandwidth is more like 84 bytes per clock.)

It can still come relatively close to 32 bytes ANDed per clock cycle, with AVX. So that's 32 NBool & operations, or 128 per clock if you pack 4 NBools per byte.

Packing NBools down to a packed bitmap of 1-bit bools could be done with pslld xmm, 7 / pmovmskb to extract the low bit of each byte (after shifting it to the high bit).

If stored 4 per byte, some SIMD bit-manipulation is in order to pack to bools, perhaps vpshufb as a 4-bit LUT to pack pairs of NBools down to a pair of bools in the bottom of a nibble, then combine? Or use scalar BMI2 pext to extract every other bit from 64 bits, if you're on Zen 3 or Haswell or later, for fast pext.

Upvotes: 70

Karl Knechtel

Reputation: 61643

Note that it works like && in C except that na values propagate when combined with anything except false, in which case we "know" that && can never be true, so we return false.

Rather than represent the values as a strict enumeration, allow a numeric value of either 2 or 3 to represent na (you can check this when displaying, or have a normalization step after all the number-crunching). This way, no conditional logic (and thus no costly branch prediction) is needed: we simply logical-or the bit in the 2s place (regardless of the operator), and logical-and (or whichever operator) the bit in the 1s place.

int is_na(int value) {
    return value & 2;
}

void r_and_into(unsigned* v_out, unsigned* v_x, int size) {
  for (int i = 0; i < size; ++i) {
    unsigned elt_out = v_out[i];
    unsigned elt_x = v_x[i];
    // this can probably be micro-optimized somehow....
    v_out[i] = (elt_out & elt_x & 1) | ((elt_out | elt_x) & 2);
  }
}

If we are forced to use INT_MIN to represent the N/A value, we can start by observing what that looks like in two's complement: it has exactly one bit set (the sign bit, which would be most-significant in unsigned values). Thus, we can use that bit value instead of 2 with the same kind of unconditional logic, and then correct any (INT_MIN | 1) results to INT_MIN:

const unsigned MSB_FLAG = (unsigned)INT_MIN;

void r_and_into(int* v_out, int* v_x, int size) {
  for (int i = 0; i < size; ++i) {
    unsigned elt_out = (unsigned)v_out[i];
    unsigned elt_x = (unsigned)v_x[i];
    elt_out = (elt_out & elt_x & 1) | ((elt_out | elt_x) & MSB_FLAG);
    // if the high bit is set, clear the low bit
    // I.E.: AND the low bit with the negation of the high bit.
    v_out[i] = (int)(elt_out & ~(elt_out >> 31));
  }
}

(All these casts might not be necessary, but I think it is good practice to use unsigned types for bitwise manipulations. They should all get optimized away anyway.)

Upvotes: 18

John Bollinger

Reputation: 181714

Why is this seemingly slower C loop actually twice as fast as the other way?

At a high level, it is a quirk of the compiler and execution environment you are using. Unless array v_x is declared volatile, the compiler is free to interpret the two variations on your code exactly the same way.

I sort of expected the second option to be faster because it avoids accessing v_x[i] when it isn't required.

And if the compiler's optimizer judged that to be true, then it could make use of that judgement to conditionally avoid reading v_x[i] in conjunction with the first code.

But at a lower level, if the compiler generates code that indeed does conditionally avoid reading v_x[i] in option 2 but not in option 1, then you are probably observing the effects of branch misprediction in the option 2 case. It is entirely plausible that it is cheaper on average to read v_x[i] unconditionally than to suffer large numbers of branch misprediction penalties involving whether it should be read.

One of the takeaways is that on modern hardware, branches can be a lot more expensive than one might expect, especially when the branch is difficult for the CPU to predict. Where the same computation can be performed via a branchless approach, that may yield a performance win in practice, at, usually, a cost in source code clarity. @KarlKnechtel's answer presents a possible branchless (but for testing the for loop condition, which is pretty predictable) variation on the computation you are trying to perform.

Upvotes: 23

Why is this seemingly slower C loop actually twice as fast as the other way?

Answers (4)

The Takeaway

Related Questions