Reputation: 2017

Nested loop efficiency in c

Which is better way and Why?

Case 1:

for(i=0; i< 100; i++)
 for(j=0; j< 10; j++)
  printf("Hello");

case 2:

for(i=0; i<10; i++)
 for(j=0; j< 100; j++)
  printf("Hello");

Upvotes: 1

Answers (2)

Lundin

Reputation: 213842

Generally speaking, neither form is better or faster. The compiler might even optimize both versions into code which only uses one loop, in which case both versions will yield identical machine code.

EDIT

I compiled both versions with gcc -O3 and both versions gave identical (although cryptic) machine code (x86):

0x00402CF0  push   %rsi
0x00402CF1  push   %rbx
0x00402CF2  sub    $0x28,%rsp
0x00402CF6  mov    $0xa,%esi
0x00402CFB  callq  0x4022f0 <__main>
0x00402D00  mov    $0x64,%ebx
0x00402D05  lea    0x12f4(%rip),%rcx        # 0x404000
0x00402D0C  callq  0x402ba8 <printf>
0x00402D11  sub    $0x1,%ebx
0x00402D14  jne    0x402d05 <main+21>
0x00402D16  sub    $0x1,%esi
0x00402D19  jne    0x402d00 <main+16>
0x00402D1B  xor    %eax,%eax
0x00402D1D  add    $0x28,%rsp
0x00402D21  pop    %rbx
0x00402D22  pop    %rsi
0x00402D23  retq

Code used for benchmarking, gcc -std=c11 -pedantic-errors -Wall -Wextra -O3:

#include <stdio.h> 

#define I 100  // only change these 2 constants between builds
#define J 10

int main (void)
{
  for(int i=0; i<I; i++)
    for(int j=0; j<J; j++)
      printf("Hello");

  return 0;
}

Efficiency problems only arise when you do things like this:

// BAD, enforces poor cache memory utilization
for(i=0; i<n; i++)
  for(j=0; j<n; j++)
    array[j][i] = something;

// BAD, enforces poor cache memory utilization
for(j=0; j<n; j++)
  for(i=0; i<n; i++)
    array[i][j] = something;

// GOOD, optimized for data cache
for(i=0; i<n; i++)
  for(j=0; j<n; j++)
    array[i][j] = something;

Upvotes: 2

Cody Gray

Reputation: 244752

Assuming we're talking about the case where the code is compiled with optimization enabled, since it is meaningless to talk about "efficiency" or "performance" when optimizations are disabled…

These should compile to identical object code. All the loop bounds are compile-time constants, so the compiler could theoretically determine how many times the code in the body of the loop will be executed, fold everything into a single loop, and emit that code. If it wanted to (and it wouldn't, because this is very space-foolish and does not offer a significant improvement in speed), it could emit 10,000 sequential calls to the printf function. This is just basic loop unrolling, and almost any optimizing compiler does that nowadays.

In the real world, compilers don't perform magic (and they aren't generally tuned to optimize dumb code or recognize its patterns), so these snippets will actually compile to slightly different object code.

Looking at GCC's output, it applies the standard peephole optimizations to the loops, but it doesn't merge them. It also does the loops in exactly the way that you wrote them. You can see that the code for Test1 and Test2 are basically identical, except for the fact that Test1 goes around 100 times in the outer loop and 10 times in the inner loop, while Test2 does exactly the opposite. This is just a matter of moving different constants into registers.

MSVC follows the same strategy when generating code. Its basic pattern optimizations for the loop structures are slightly different than GCC's, but the code is morally equivalent. The only difference between Test1 and Test2 is whether the outer loop spins from 0 to 100, or 0 to 10.

What about performance? Well, the only right way to answer this question would be to compile both samples and check. In fact, that's the only way you ever reach an objective answer to a performance question. But you'll immediately have a problem if you try to do that: the printf function in the loop will massively dominate the amount of time taken by anything else, causing your benchmark results to be noisy and meaningless. You'd need to figure out something else to do inside of the loop that would not have as significant of an effect on the time that you're trying to measure, and it would have to be something with side effects that prevent the compiler from trivially optimizing it out. This is why such microbenchmarks are exceedingly difficult to do correctly. They are also not particularly interesting; what you should be benchmarking is real code. This is not real code. So I won't even bother trying to get meaningful benchmark numbers out of it.

The only thing I'll allow myself to do is noodle a bit about the conceptual performance implications of the code generated for the two functions. I would guess that making the larger loop the inner loop (i.e., Test2) would be slightly faster. Why? Well, because once the code gets loaded into the instruction cache, it is executed 100 times in rapid sequence, with the branch predictor successfully predicting the target of the branch in nearly all cases. This is as efficient as it gets for a tight loop. In the other case, you'd only get to do 10 iterations under these optimal conditions before bunking out and having to start over, which risks evicting code from the instruction cache. You'd have to test and/or really study the specifics of the code to see if that's actually a possibility, because it depends on the exact size of the code and how much cache your processor has available, but it's a theoretical concern.

Switching gears, let's see what Clang generates. Interesting! The code looks pretty different for the two test functions. With Test1, Clang has unrolled the inner loop completely and emitted 10 back-to-back calls to the printf function. This is then wrapped inside of a loop that spins 100 times. Again, it's consistent with the C code that you originally wrote, but since the inner loop had so few iterations, Clang's optimizer determined that it was probably a performance win to unroll it. And it's probably right. What happened with Test2? Well, sort of the same thing—it just unrolled it differently because you wrote the original code differently. It has unrolled the outer loop to give 10 back-to-back code sequences that loop from 0 to 100.

Continuing with our theme of breaking the cardinal rule of performance analysis, we'll skip benchmarking the output and just think about it conceptually. The first thing that jumps out is that Test2 requires a lot more code—it takes more than twice as many bytes to encode those instructions (321 vs. 141 bytes). Smaller code isn't always faster, of course, but here, where there isn't otherwise a clear winner, I'd be inclined to err in the direction of smaller code. The only thing that might affect that analysis is if the amount of code inside the body of the unrolled loop in Test1 was too much to fit in the cache. The loop bodies in Test2 are much smaller, even though the overall code is larger, so they're virtually guaranteed to be hot in the cache. Having the code in the cache is way better for performance. Hmmm, I guess we won't be able to tell without benchmarking after all.

In summary:

Answer performance questions by benchmarking.
Always benchmark real code, not arbitrary test cases (because generating correct cases that give meaningful results is extremely difficult).
Theoretically, a perfect optimizing compiler should translate these snippets into identical code.
In practice, that may not be true. Different compilers emit slightly different code depending on how you write the code. However, all generate eminently sensible code, following the lead that you set in the original C source.
Again, theoretically, these should have identical performance. In reality, it may be slightly more complicated than that. But in practice, it doesn't matter, because both will be fast enough. The difference we're talking is on the order of nanoseconds. You are wasting time worrying about this.

Upvotes: 2

Nested loop efficiency in c

Answers (2)

In summary:

Related Questions