Reputation: 21996

Loop unrolling optimization, how does this work

Consider this C-code:

int sum=0;
for(int i=0;i<5;i++)
    sum+=i;

This could be translated in (pseudo-) assembly this way (without loop unrolling):

% pseudo-code assembly
ADDI $R10, #0   % sum
ADDI $R11, #0   % i
LOOP:
ADD $R10, $R11
ADDI $R11, #1
BNE $R11, #5 LOOP

So my first question is how is this code translated using loop unrolling, between these two ways:

ADDI $R10, #0
ADDI $R10, #0
ADDI $R10, #1
ADDI $R10, #2
ADDI $R10, #3
ADDI $R10, #4

   ADD $R10, #10

Is the compiler able to optimize the code and directly know that it has to add 10 without performing all sums?

Also, is there a possibility to block the pipeline with a branch instruction? Do I have to write it this way:

% pseudo-code assembly
ADDI $R10, #0   % sum
ADDI $R11, #0   % i
LOOP:
ADD $R10, $R11
ADDI $R11, #1
NOP   % is this necessary to avoid the pipeline blocking?
NOP
NOP
NOP
BNE $R11, #5 LOOP

To avoid that the fetch-decode-exe-mem-write back cycle is interrupted by the branch?

Upvotes: 2

Answers (4)

Mike Kwan

Reputation: 24477

This is more for demonstration of what a compiler is capable of, rather than what every compiler would do. The source:

#include <stdio.h>

int main(void)
{
    int i, sum = 0;

    for(i=0; i<5; i++) {
        sum+=i;
    }

    printf("%d\n", sum);
    return 0;
}

Note the printf I have added. If the variable is not used, the compiler will optimize out the entire loop.

Compiling with -O0 (No optimization)

gcc -Wall -O0 -S -c lala.c:

.L3:
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
.L2:
    cmpl    $4, -8(%rbp)
    jle .L3

The loop happens in a 'dumb' way, with -8(%rbp) being the variable i.

Compiling with -O1 (Optimization level 1)

gcc -Wall -O1 -S -c lala.c:

movl    $10, %edx

The loop has been completely removed and replaced with the equivalent value.

In unrolling, the compiler looks to see how many iterations would happen and tries to unroll by performing less iterations. For example, the loop body might be duplicated twice which would result in the number of branches to be halved. Such a case in C:

int i = 0, sum = 0;

sum += i;
i++;

for(; i<5;i++) {
    sum+=i;
    i++;
    sum+=i;
}

Notice that one iteration had to be extracted out of the loop. This is because 5 is an odd number and so the work can not simply be halved by duplicating the contents. In this case the loop will only be entered twice. The assembly code produced by -O0:

    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    jmp .L2
.L3:
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
.L2:
    cmpl    $4, -8(%rbp)

Completely unrolling in C:

for(i=0; i<5;i++) {
    sum+=i;
    i++;
    sum+=i;
    i++;
    sum+=i;
    i++;
    sum+=i;
    i++;
    sum+=i;
}

This time the loop is actually entered only once. The assembly produced with -O0:

.L3:
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
.L2:
    cmpl    $4, -8(%rbp)
    jle .L3

Upvotes: 10

Jeff Mercado

Reputation: 134591

At the basic level, the concept of loop unrolling is just simply copying the body of the loop multiple times as appropriate. The compiler may do other optimizations (such as inserting fixed values from a calculation) as well but wouldn't be considered as unrolling the loop but potentially replacing it all together. But that would ultimately depend on the compiler and flags used.

The C code (unrolled only) would look more like this:

int sum = 0;
int i = 0;
for ( ; i < (5 & ~(4-1)); i += 4) /* unrolling 4 iterations */
{
    sum+=(i+0);
    sum+=(i+1);
    sum+=(i+2);
    sum+=(i+3);
}
for ( ; i < 5; i++)
{
    sum+=i;
}

Though there's plenty of opportunities for the compiler to make even more optimizations here, this is just one step.

Upvotes: 2

LeleDumbo

Reputation: 9340

So my first question is how is this code translated using loop unrolling, between these two ways

This kind of optimization is usually implemented on AST level instead of output code (e.g. assembly) level. Loop unrolling can be done when the number of iteration is fixed and known at compile time. So for instance I have this AST:

Program
|
+--For
   |
   +--Var
   |  |
   |  +--Variable i
   |
   +--Start
   |  |
   |  +--Constant 1
   |
   +--End
   |  |
   |  +--Constant 3
   |
   +--Statements
      |
      + Print i

The compiler would have known that For's Start and End are constants, and therefore could easily copy the Statements, replacing all occurences of Var with its value for each call. For above AST, it would be translated to:

Program
|
+--Print 1
|
+--Print 2
|
+--Print 3

Is the compiler able to optimize the code and directly know that it has to add 10 without performing all sums?

Yes, if it's implemented to have such a feature. It's actually an improvement over the above case. In your example case, after doing the unrolling, the compiler could see that all l-value remains the same while r-value are constants. Therefore it could perform peephole optimization combined with constant folding to yield single addition. If the peephole optimization also considers the declaration, then it could be even optimized more into a single move instruction.

Upvotes: 2

Jens Gustedt

Reputation: 78973

There is no general answer possible for this, different compilers, different versions of them, different compiler flags will vary. Use the appropriate option of your compiler to look at the assembler outcome. With gcc and relatives this is the -S option.

Upvotes: 0

Loop unrolling optimization, how does this work

Answers (4)

Related Questions