Does this optimization even matter?

This page recommends "loop unrolling" as an optimization:

Loop overhead can be reduced by reducing the number of iterations and replicating the body of the loop.

Example:

In the code fragment below, the body of the loop can be replicated once and the number of iterations can be reduced from 100 to 50.
for (i = 0; i < 100; i++)
  g ();
Below is the code fragment after loop unrolling.
for (i = 0; i < 100; i += 2)
{
  g ();
  g ();
}

With GCC 5.2, loop unrolling isn't enabled unless you use -funroll-loops (it's not enabled in either -O2 or -O3). I've inspected the assembly to see if there's a significant difference.

g++ -std=c++14 -O3 -funroll-loops -c -Wall -pedantic -pthread main.cpp && objdump -d main.o

Version 1:

   0:   ba 64 00 00 00          mov    $0x64,%edx
   5:   0f 1f 00                nopl   (%rax)
   8:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # e <main+0xe>
   e:   83 c0 01                add    $0x1,%eax
  # ... etc ...
  a1:   83 c1 01                add    $0x1,%ecx
  a4:   83 ea 0a                sub    $0xa,%edx
  a7:   89 0d 00 00 00 00       mov    %ecx,0x0(%rip)        # ad <main+0xad>
  ad:   0f 85 55 ff ff ff       jne    8 <main+0x8>
  b3:   31 c0                   xor    %eax,%eax
  b5:   c3                      retq

Version 2:

   0:   ba 32 00 00 00          mov    $0x32,%edx
   5:   0f 1f 00                nopl   (%rax)
   8:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # e <main+0xe>
   e:   83 c0 01                add    $0x1,%eax
  11:   89 05 00 00 00 00       mov    %eax,0x0(%rip)        # 17 <main+0x17>
  17:   8b 0d 00 00 00 00       mov    0x0(%rip),%ecx        # 1d <main+0x1d>
  1d:   83 c1 01                add    $0x1,%ecx
  # ... etc ...
 143:   83 c7 01                add    $0x1,%edi
 146:   83 ea 0a                sub    $0xa,%edx
 149:   89 3d 00 00 00 00       mov    %edi,0x0(%rip)        # 14f <main+0x14f>
 14f:   0f 85 b3 fe ff ff       jne    8 <main+0x8>
 155:   31 c0                   xor    %eax,%eax
 157:   c3                      retq

Version 2 produces more iterations. What am I missing?

Upvotes: 2

Answers (2)

Davislor

Reputation: 15154

It doesn’t produce more iterations; you’ll notice that the loop that calls g() twice runs half as many times. (What if you have to call g() an odd number of times? Look up Duff’s Device.)

In your listings, you’ll notice that the assembly-language instruction jne 8 <main+0x8> appears once in both. This tells the processor to go back to the start of the loop. In the original loop, this instruction will run 99 times. In the rolled loop, it will only run 49 times. Imagine if the body of the loop is something very short, just one or two instructions. These jumps might be a third or even half of the instructions in the most performance-critical part of your program! (And there is even a useful loop with zero instructions: BogoMIPS. But the article about optimizing that was a joke.)

So, unrolling the loop trades speed for code size, right? Not so fast. Maybe you’ve made your unrolled loop so big that the code at the top of the loop is no longer in the cache, and the CPU needs to fetch it. In the real world, the only way to know if it helps is to profile your program.

Upvotes: 1

Thomas Matthews

Reputation: 57749

Yes, there are cases where loop unrolling will make the code more efficient.

The theory is reduce the less overhead (branching to top of loop and incrementing loop counter).

Most processors hate branch instructions. They love data processing instructions. For every iteration, there is a minimum of one branch instruction. By "duplicating" a set of code, the number of branches is reduced and the data processing instructions is increased between branches.

Many modern compilers have optimization settings to perform loop unrolling.

Upvotes: 4

Does this optimization even matter?

Answers (2)

Related Questions