JAN
JAN

Reputation: 21885

Loop unrolling & optimization

Given the code :

for (int i = 0; i < n; ++i) 
{ 
  A(i) ; 
  B(i) ; 
  C(i) ; 
}

And the optimization version :

for (int i = 0; i < (n - 2); i+=3) 
{ 
  A(i) 
  A(i+1) 
  A(i+2) 
  B(i) 
  B(i+1) 
  B(i+2) 
  C(i) 
  C(i+1) 
  C(i+2)
}

Something is not clear to me : which is better ? I can't see anything that works any faster using the other version . Am I missing something here ?

All I see is that each instruction is depending on the previous instruction , meaning that I need to wait that the previous instruction would finish in order to start the one after ...

Thanks

Upvotes: 5

Views: 13975

Answers (5)

P.P
P.P

Reputation: 121427

Loop unrolling is used to reduce the number of jump & branch instructions which could potentially make the loop faster but will increase the size of the binary. Depending on the implementation and platform, either could be faster.

Upvotes: 4

Brady
Brady

Reputation: 10357

Generally its not a good idea to try to "invent" optimizations, unless you have hard evidence that you will gain an increase, because many times you may end up introducing a degradation. Typically the best way to obtain such evidence is with a good profiler. I would test both versions of this code with a profiler to see the difference.

Also, many times loop unrolling isnt very protable, as mentioned previously, it depends greatly on the platform, compiler, etc.

You can additionally play with the compiler options. An interesting gcc option is "-floop-optimize", that you get automatically with "-O, -O2, -O3, and -Os"

EDIT Additionally, look at the "-funroll-loops" compiler option.

Upvotes: 0

Baldy
Baldy

Reputation: 2002

As long as the functions A(), B() and C() don't modify the same datasets, the second verion provides more parallelization options.

In the first version, the three functions could run simultaneously, assuming no interdependencies. In the second version, all three functions could be run with all three datasets at the same time, assuming you had enough execution units to do so and again, no interdependencies.

Upvotes: 2

Nathaniel Ford
Nathaniel Ford

Reputation: 21249

In the high-level view of a language, you're not going to see the optimization. The speed enhancement comes from what the compiler does with what you have.

In the first case, it's something like:

LOCATION_FLAG;
DO_SOMETHING;
TEST FOR LOOP COMPLETION;//Jumps to LOCATION_FLAG if false

In the second it's something like:

LOCATION_FLAG;
DO_SOMETHING;
DO_SOMETHING;
DO_SOMETHING;
TEST FOR LOOP COMPLETION;//Jumps to LOCATION_FLAG if false

You can see in the latter case, the overhead of testing and jumping is only 1 instruction per 3. In the first it's 1 instruction per 1; so it happens a lot more often.

Therefore, if you have invariants you can rely on (an array of mod 3, to use your example) then it is more efficient to unwind loops because the underlying assembly is written more directly.

Upvotes: 9

Johan Kotlinski
Johan Kotlinski

Reputation: 25759

Well, whether this code is "better" or "worse" totally depends on implementations of A, B and C, which values of n you expect, which compiler you are using and which hardware you are running on.

Typically the benefit of loop unrolling is that the overhead of doing the loop (that is, increasing i and comparing it with n) is reduced. In this case, could be reduced by a factor of 3.

Upvotes: 3

Related Questions