Reputation: 12852
Please note: this question is neither about code quality, and ways to improve the code, nor about the (in)significance of the runtime differences. It is about GCC and why which compiler optimisation costs performance.
The following code counts the number of Fibonacci primes up to m
:
int main() {
unsigned int m = 500000000u;
unsigned int i = 0u;
unsigned int a = 1u;
unsigned int b = 1u;
unsigned int c = 1u;
unsigned int count = 0u;
while (a + b <= m) {
for (i = 2u; i < a + b; ++i) {
c = (a + b) % i;
if (c == 0u) {
i = a + b;
// break;
}
}
if (c != 0u) {
count = count + 1u;
}
a = a + b;
b = a - b;
}
return count; // Just to "output" (and thus use) count
}
When compiled with g++.exe (Rev2, Built by MSYS2 project) 9.2.0
and no optimisations (-O0
), the resulting binary executes (on my machine) in 1.9s. With -O1
and -O3
it takes 3.3s and 1.7s, respectively.
I've tried to make sense of the resulting binaries by looking at the assembly code (godbolt.org) and the corresponding control-flow graph (hex-rays.com/products/ida), but my assembler skills don't suffice.
An explicit break
in the innermost if
makes the -O1
code fast again:
if (c == 0u) {
i = a + b; // Not actually needed any more
break;
}
As does "inlining" the loop's progress expression:
for (i = 2u; i < a + b; ) { // No ++i any more
c = (a + b) % i;
if (c == 0u) {
i = a + b;
++i;
} else {
++i;
}
}
Which optimisation does/could explain the performance drop?
Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?
Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?
Upvotes: 4
Views: 421
Reputation: 2409
The important thing at play here are loop-carried data dependencies.
Look at machine code of the slow variant of the innermost loop. I'm showing -O2
assembly here, -O1
is less optimized, but has similar data dependencies overall:
.L4:
xorl %edx, %edx
movl %esi, %eax
divl %ecx
testl %edx, %edx
cmove %esi, %ecx
addl $1, %ecx
cmpl %ecx, %esi
ja .L4
See how the increment of the loop counter in %ecx
depends on the previous instruction (the cmov
), which in turn depends on the result of the division, which in turn depends on the previous value of loop counter.
Effectively there is a chain of data dependencies on computing the value in %ecx
that spans the entire loop, and since the time to execute the loop dominates, the time to compute that chain decides the execution time of the program.
Adjusting the program to compute the number of divisions reveals that it executes 434044698 div
instructions. Dividing the number of machine cycles taken by the program by this number gives 26 cycles in my case, which corresponds closely to latency of the div
instruction plus about 3 or 4 cycles from the other instructions in the chain (the chain is div
-test
-cmov
-add
).
In contrast, the -O3
code does not have this chain of dependencies, making it throughput-bound rather than latency-bound: the time to execute the -O3
variant is determined by the time to compute 434044698 independent div
instructions.
Finally, to give specific answers to your questions:
1. Which optimisation does/could explain the performance drop?
As another answer mentioned, this is if-conversion creating a loop-carried data dependency where originally there was a control dependency. Control dependencies may be costly too, when they correspond to unpredictable branches, but in this case the branch is easy to predict.
2. Is it possible to explain what triggers the optimisation in terms of the C++ code (i.e. without a deep understanding of GCC's internals)?
Perhaps you can imagine the optimization transforming the code to
for (i = 2u; i < a + b; ++i) {
c = (a + b) % i;
i = (c != 0) ? i : a + b;
}
Where the ternary operator is evaluated on the CPU such that new value of i
is not known until c
has been computed.
3. Similarly, is there a high-level explanation for why the alternatives (additional observations) apparently prevent the rogue optimisation?
In those variants the code is not eligible for if-conversion, so the problematic data dependency is not introduced.
Upvotes: 3
Reputation: 98516
I think the problem is in the -fif-conversion
that instructs the compiler to do CMOV
instead of TEST/JZ
for some comparisons. And CMOV
is known for being not so great in the general case.
There are two points in the disassembly, that I know of, affected by this flag:
First, the if (c == 0u) { i = a + b; }
in line 13 is compiled to:
test edx,edx //edx is c
cmove ecx,esi //esi is (a + b), ecx is i
Second, the if (c != 0u) { count = count + 1u; }
is compiled to
cmp eax,0x1 //eax is c
sbb r8d,0xffffffff //r8d is count, but what???
Nice trick! It is substracting -1
to count
but with carry, and the carry is only set if c
is less than 1
, which being unsigned means 0
. Thus, if eax
is 0 it substracts -1 to count
but then substracts 1 again: it does not change. If eax
is not 0, then it substracts -1
, that increments the variable.
Now, this avoids branches, but at the cost of missing the obvious optimization that if c == 0u
you could jump directly to the next while
iteration. This one is so easy that it is even done in -O0
.
Upvotes: 3
Reputation: 20765
I believe this is caused by the "conditional move" instruction (CMOVEcc) that the compiler generates to replace branching when using -O1
and -O2
.
When using -O0
, the statement if (c == 0u)
is compiled to a jump:
cmp DWORD PTR [rbp-16], 0
jne .L4
With -O1
and -O2
:
test edx, edx
cmove ecx, esi
while -O3
produces a jump (similar to -O0
):
test edx, edx
je .L5
There is a known bug in gcc where "using conditional moves instead of compare and branch result in almost 2x slower code"
As rodrigo suggested in his comment, using the flag -fno-if-conversion
tells gcc not to replace branching with conditional moves, hence preventing this performance issue.
Upvotes: 2