Reputation: 321
I am performing optimizations on C for-loops, and I just read up on unrolling and accumulators. If data is not dependent of eachother in the loop, the use of unrolling and accumulators really takes advantage of parallelism, and the code finishes faster.
So my naive thought was, why not add more accumulators and unroll more times?
I did this, and noticed that there was diminishing returns in the reduction of average cycles per element.
My question is why?
A: Is it because we are running out of registers to work with simultaneously, and information needs to be stored in memory?
B: Or is it because the 'cleanup loop' has to process more elements after the unrolled loop?
Is it a combination of A and B?
Upvotes: 2
Views: 1147
Reputation: 25865
I'm not sure if I'm just stating the obvious here, but the main reason why you're seeing diminishing returns from unrolling is simply because you've largely eliminated the overhead from the loop, and the remaining time on the CPU is spent almost entirely in the "useful" work that you're doing.
The benefit of unrolling is that you're eliminating the overhead of the loop itself -- that is, index increment, comparisons, branching, &c. -- not that it makes the useful work of the loop any faster. When you've reached the point where the loop overhead is mostly eliminated, it should be obvious that you aren't going to see further improvements from more unrolling.
On the other hand, there are certainly some aspects of further unrolling that makes performance worse, such as registers spilling to memory, the I-cache working less efficiently, the loop being too large for the trace-cache (on processors that sport such), &c.
Upvotes: 3
Reputation: 11448
More likely, A. I've seen that not much time ago. I did myself the same question and the conclusion I reached was that I ran out of registers so no more fast accumulators. The clean-up code to process the rest of elements not unrolled run for much less time that the main unrolled loop.
Upvotes: 1