Reputation: 21
This page recommends "loop unrolling" as an optimization:
Loop overhead can be reduced by reducing the number of iterations and replicating the body of the loop.
Example:
In the code fragment below, the body of the loop can be replicated once and the number of iterations can be reduced from 100 to 50.
for (i = 0; i < 100; i++) g ();
Below is the code fragment after loop unrolling.
for (i = 0; i < 100; i += 2) { g (); g (); }
With GCC 5.2, loop unrolling isn't enabled unless you use -funroll-loops
(it's not enabled in either -O2
or -O3
). I've inspected the assembly to see if there's a significant difference.
g++ -std=c++14 -O3 -funroll-loops -c -Wall -pedantic -pthread main.cpp && objdump -d main.o
Version 1:
0: ba 64 00 00 00 mov $0x64,%edx
5: 0f 1f 00 nopl (%rax)
8: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # e <main+0xe>
e: 83 c0 01 add $0x1,%eax
# ... etc ...
a1: 83 c1 01 add $0x1,%ecx
a4: 83 ea 0a sub $0xa,%edx
a7: 89 0d 00 00 00 00 mov %ecx,0x0(%rip) # ad <main+0xad>
ad: 0f 85 55 ff ff ff jne 8 <main+0x8>
b3: 31 c0 xor %eax,%eax
b5: c3 retq
Version 2:
0: ba 32 00 00 00 mov $0x32,%edx
5: 0f 1f 00 nopl (%rax)
8: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # e <main+0xe>
e: 83 c0 01 add $0x1,%eax
11: 89 05 00 00 00 00 mov %eax,0x0(%rip) # 17 <main+0x17>
17: 8b 0d 00 00 00 00 mov 0x0(%rip),%ecx # 1d <main+0x1d>
1d: 83 c1 01 add $0x1,%ecx
# ... etc ...
143: 83 c7 01 add $0x1,%edi
146: 83 ea 0a sub $0xa,%edx
149: 89 3d 00 00 00 00 mov %edi,0x0(%rip) # 14f <main+0x14f>
14f: 0f 85 b3 fe ff ff jne 8 <main+0x8>
155: 31 c0 xor %eax,%eax
157: c3 retq
Version 2 produces more iterations. What am I missing?
Upvotes: 2
Views: 119
Reputation: 15134
It doesn’t produce more iterations; you’ll notice that the loop that calls g()
twice runs half as many times. (What if you have to call g()
an odd number of times? Look up Duff’s Device.)
In your listings, you’ll notice that the assembly-language instruction jne 8 <main+0x8>
appears once in both. This tells the processor to go back to the start of the loop. In the original loop, this instruction will run 99 times. In the rolled loop, it will only run 49 times. Imagine if the body of the loop is something very short, just one or two instructions. These jumps might be a third or even half of the instructions in the most performance-critical part of your program! (And there is even a useful loop with zero instructions: BogoMIPS. But the article about optimizing that was a joke.)
So, unrolling the loop trades speed for code size, right? Not so fast. Maybe you’ve made your unrolled loop so big that the code at the top of the loop is no longer in the cache, and the CPU needs to fetch it. In the real world, the only way to know if it helps is to profile your program.
Upvotes: 1
Reputation: 57678
Yes, there are cases where loop unrolling will make the code more efficient.
The theory is reduce the less overhead (branching to top of loop and incrementing loop counter).
Most processors hate branch instructions. They love data processing instructions. For every iteration, there is a minimum of one branch instruction. By "duplicating" a set of code, the number of branches is reduced and the data processing instructions is increased between branches.
Many modern compilers have optimization settings to perform loop unrolling.
Upvotes: 4