Malte Schwerhoff
Malte Schwerhoff

Reputation: 12852

Explanation for GCC compiler optimisation's adverse performance effect?

Please note: this question is neither about code quality, and ways to improve the code, nor about the (in)significance of the runtime differences. It is about GCC and why which compiler optimisation costs performance.

The program

The following code counts the number of Fibonacci primes up to m:

int main() {
  unsigned int m = 500000000u;
  unsigned int i = 0u;
  unsigned int a = 1u; 
  unsigned int b = 1u; 
  unsigned int c = 1u;
  unsigned int count = 0u;

  while (a + b <= m) {
    for (i = 2u; i < a + b; ++i) {
      c = (a + b) % i;

      if (c == 0u) {
        i = a + b;
        // break;
      }
    }
    
    if (c != 0u) {
      count = count + 1u;
    } 
  
    a = a + b;
    b = a - b;
  }

  return count; // Just to "output" (and thus use) count
}

When compiled with g++.exe (Rev2, Built by MSYS2 project) 9.2.0 and no optimisations (-O0), the resulting binary executes (on my machine) in 1.9s. With -O1 and -O3 it takes 3.3s and 1.7s, respectively.

I've tried to make sense of the resulting binaries by looking at the assembly code (godbolt.org) and the corresponding control-flow graph (hex-rays.com/products/ida), but my assembler skills don't suffice.

Additional observations

  1. An explicit break in the innermost if makes the -O1 code fast again:

    if (c == 0u) {
      i = a + b; // Not actually needed any more
      break;
    }
    
  2. As does "inlining" the loop's progress expression:

    for (i = 2u; i < a + b; ) { // No ++i any more
      c = (a + b) % i;
    
      if (c == 0u) {
        i = a + b;
        ++i;
      } else {
        ++i;
      }
    }
    

Questions

  1. Which optimisation does/could explain the performance drop?

  2. Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?

  3. Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?

Upvotes: 4

Views: 421

Answers (3)

amonakov
amonakov

Reputation: 2409

The important thing at play here are loop-carried data dependencies.

Look at machine code of the slow variant of the innermost loop. I'm showing -O2 assembly here, -O1 is less optimized, but has similar data dependencies overall:

.L4:
        xorl    %edx, %edx
        movl    %esi, %eax
        divl    %ecx
        testl   %edx, %edx
        cmove   %esi, %ecx
        addl    $1, %ecx
        cmpl    %ecx, %esi
        ja      .L4

See how the increment of the loop counter in %ecx depends on the previous instruction (the cmov), which in turn depends on the result of the division, which in turn depends on the previous value of loop counter.

Effectively there is a chain of data dependencies on computing the value in %ecx that spans the entire loop, and since the time to execute the loop dominates, the time to compute that chain decides the execution time of the program.

Adjusting the program to compute the number of divisions reveals that it executes 434044698 div instructions. Dividing the number of machine cycles taken by the program by this number gives 26 cycles in my case, which corresponds closely to latency of the div instruction plus about 3 or 4 cycles from the other instructions in the chain (the chain is div-test-cmov-add).

In contrast, the -O3 code does not have this chain of dependencies, making it throughput-bound rather than latency-bound: the time to execute the -O3 variant is determined by the time to compute 434044698 independent div instructions.

Finally, to give specific answers to your questions:

1. Which optimisation does/could explain the performance drop?

As another answer mentioned, this is if-conversion creating a loop-carried data dependency where originally there was a control dependency. Control dependencies may be costly too, when they correspond to unpredictable branches, but in this case the branch is easy to predict.

2. Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?

Perhaps you can imagine the optimization transforming the code to

    for (i = 2u; i < a + b; ++i) {
      c = (a + b) % i;
      i = (c != 0) ? i : a + b; 
    }

Where the ternary operator is evaluated on the CPU such that new value of i is not known until c has been computed.

3. Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?

In those variants the code is not eligible for if-conversion, so the problematic data dependency is not introduced.

Upvotes: 3

rodrigo
rodrigo

Reputation: 98516

I think the problem is in the -fif-conversion that instructs the compiler to do CMOV instead of TEST/JZ for some comparisons. And CMOV is known for being not so great in the general case.

There are two points in the disassembly, that I know of, affected by this flag:

First, the if (c == 0u) { i = a + b; } in line 13 is compiled to:

test   edx,edx //edx is c
cmove  ecx,esi //esi is (a + b), ecx is i

Second, the if (c != 0u) { count = count + 1u; } is compiled to

cmp    eax,0x1   //eax is c
sbb    r8d,0xffffffff //r8d is count, but what???

Nice trick! It is substracting -1 to count but with carry, and the carry is only set if c is less than 1, which being unsigned means 0. Thus, if eax is 0 it substracts -1 to count but then substracts 1 again: it does not change. If eax is not 0, then it substracts -1, that increments the variable.

Now, this avoids branches, but at the cost of missing the obvious optimization that if c == 0u you could jump directly to the next while iteration. This one is so easy that it is even done in -O0.

Upvotes: 3

Aziz
Aziz

Reputation: 20765

I believe this is caused by the "conditional move" instruction (CMOVEcc) that the compiler generates to replace branching when using -O1 and -O2.

When using -O0, the statement if (c == 0u) is compiled to a jump:

    cmp     DWORD PTR [rbp-16], 0
    jne     .L4

With -O1 and -O2:

    test    edx, edx
    cmove   ecx, esi

while -O3 produces a jump (similar to -O0):

    test    edx, edx
    je      .L5

There is a known bug in gcc where "using conditional moves instead of compare and branch result in almost 2x slower code"

As rodrigo suggested in his comment, using the flag -fno-if-conversion tells gcc not to replace branching with conditional moves, hence preventing this performance issue.

Upvotes: 2

Related Questions