C loop unrolling limitations?

Question

I am performing optimizations on C for-loops, and I just read up on unrolling and accumulators. If data is not dependent of eachother in the loop, the use of unrolling and accumulators really takes advantage of parallelism, and the code finishes faster.

So my naive thought was, why not add more accumulators and unroll more times?

I did this, and noticed that there was diminishing returns in the reduction of average cycles per element.

My question is why?

A: Is it because we are running out of registers to work with simultaneously, and information needs to be stored in memory?

B: Or is it because the 'cleanup loop' has to process more elements after the unrolled loop?

Is it a combination of A and B?

Dolda2000 · Accepted Answer

I'm not sure if I'm just stating the obvious here, but the main reason why you're seeing diminishing returns from unrolling is simply because you've largely eliminated the overhead from the loop, and the remaining time on the CPU is spent almost entirely in the "useful" work that you're doing.

The benefit of unrolling is that you're eliminating the overhead of the loop itself -- that is, index increment, comparisons, branching, &c. -- not that it makes the useful work of the loop any faster. When you've reached the point where the loop overhead is mostly eliminated, it should be obvious that you aren't going to see further improvements from more unrolling.

On the other hand, there are certainly some aspects of further unrolling that makes performance worse, such as registers spilling to memory, the I-cache working less efficiently, the loop being too large for the trace-cache (on processors that sport such), &c.

C loop unrolling limitations?

Answers (2)

Related Questions