Reputation: 2017
Which is better way and Why?
Case 1:
for(i=0; i< 100; i++)
for(j=0; j< 10; j++)
printf("Hello");
case 2:
for(i=0; i<10; i++)
for(j=0; j< 100; j++)
printf("Hello");
Upvotes: 1
Views: 1983
Reputation: 213842
Generally speaking, neither form is better or faster. The compiler might even optimize both versions into code which only uses one loop, in which case both versions will yield identical machine code.
EDIT
I compiled both versions with gcc -O3 and both versions gave identical (although cryptic) machine code (x86):
0x00402CF0 push %rsi
0x00402CF1 push %rbx
0x00402CF2 sub $0x28,%rsp
0x00402CF6 mov $0xa,%esi
0x00402CFB callq 0x4022f0 <__main>
0x00402D00 mov $0x64,%ebx
0x00402D05 lea 0x12f4(%rip),%rcx # 0x404000
0x00402D0C callq 0x402ba8 <printf>
0x00402D11 sub $0x1,%ebx
0x00402D14 jne 0x402d05 <main+21>
0x00402D16 sub $0x1,%esi
0x00402D19 jne 0x402d00 <main+16>
0x00402D1B xor %eax,%eax
0x00402D1D add $0x28,%rsp
0x00402D21 pop %rbx
0x00402D22 pop %rsi
0x00402D23 retq
Code used for benchmarking, gcc -std=c11 -pedantic-errors -Wall -Wextra -O3
:
#include <stdio.h>
#define I 100 // only change these 2 constants between builds
#define J 10
int main (void)
{
for(int i=0; i<I; i++)
for(int j=0; j<J; j++)
printf("Hello");
return 0;
}
Efficiency problems only arise when you do things like this:
// BAD, enforces poor cache memory utilization
for(i=0; i<n; i++)
for(j=0; j<n; j++)
array[j][i] = something;
// BAD, enforces poor cache memory utilization
for(j=0; j<n; j++)
for(i=0; i<n; i++)
array[i][j] = something;
// GOOD, optimized for data cache
for(i=0; i<n; i++)
for(j=0; j<n; j++)
array[i][j] = something;
Upvotes: 2
Reputation: 244752
Assuming we're talking about the case where the code is compiled with optimization enabled, since it is meaningless to talk about "efficiency" or "performance" when optimizations are disabled…
These should compile to identical object code. All the loop bounds are compile-time constants, so the compiler could theoretically determine how many times the code in the body of the loop will be executed, fold everything into a single loop, and emit that code. If it wanted to (and it wouldn't, because this is very space-foolish and does not offer a significant improvement in speed), it could emit 10,000 sequential calls to the printf
function. This is just basic loop unrolling, and almost any optimizing compiler does that nowadays.
In the real world, compilers don't perform magic (and they aren't generally tuned to optimize dumb code or recognize its patterns), so these snippets will actually compile to slightly different object code.
Looking at GCC's output, it applies the standard peephole optimizations to the loops, but it doesn't merge them. It also does the loops in exactly the way that you wrote them. You can see that the code for Test1
and Test2
are basically identical, except for the fact that Test1
goes around 100 times in the outer loop and 10 times in the inner loop, while Test2
does exactly the opposite. This is just a matter of moving different constants into registers.
MSVC follows the same strategy when generating code. Its basic pattern optimizations for the loop structures are slightly different than GCC's, but the code is morally equivalent. The only difference between Test1
and Test2
is whether the outer loop spins from 0 to 100, or 0 to 10.
What about performance? Well, the only right way to answer this question would be to compile both samples and check. In fact, that's the only way you ever reach an objective answer to a performance question. But you'll immediately have a problem if you try to do that: the printf
function in the loop will massively dominate the amount of time taken by anything else, causing your benchmark results to be noisy and meaningless. You'd need to figure out something else to do inside of the loop that would not have as significant of an effect on the time that you're trying to measure, and it would have to be something with side effects that prevent the compiler from trivially optimizing it out. This is why such microbenchmarks are exceedingly difficult to do correctly. They are also not particularly interesting; what you should be benchmarking is real code. This is not real code. So I won't even bother trying to get meaningful benchmark numbers out of it.
The only thing I'll allow myself to do is noodle a bit about the conceptual performance implications of the code generated for the two functions. I would guess that making the larger loop the inner loop (i.e., Test2
) would be slightly faster. Why? Well, because once the code gets loaded into the instruction cache, it is executed 100 times in rapid sequence, with the branch predictor successfully predicting the target of the branch in nearly all cases. This is as efficient as it gets for a tight loop. In the other case, you'd only get to do 10 iterations under these optimal conditions before bunking out and having to start over, which risks evicting code from the instruction cache. You'd have to test and/or really study the specifics of the code to see if that's actually a possibility, because it depends on the exact size of the code and how much cache your processor has available, but it's a theoretical concern.
Switching gears, let's see what Clang generates. Interesting! The code looks pretty different for the two test functions. With Test1
, Clang has unrolled the inner loop completely and emitted 10 back-to-back calls to the printf
function. This is then wrapped inside of a loop that spins 100 times. Again, it's consistent with the C code that you originally wrote, but since the inner loop had so few iterations, Clang's optimizer determined that it was probably a performance win to unroll it. And it's probably right. What happened with Test2
? Well, sort of the same thing—it just unrolled it differently because you wrote the original code differently. It has unrolled the outer loop to give 10 back-to-back code sequences that loop from 0 to 100.
Continuing with our theme of breaking the cardinal rule of performance analysis, we'll skip benchmarking the output and just think about it conceptually. The first thing that jumps out is that Test2
requires a lot more code—it takes more than twice as many bytes to encode those instructions (321 vs. 141 bytes). Smaller code isn't always faster, of course, but here, where there isn't otherwise a clear winner, I'd be inclined to err in the direction of smaller code. The only thing that might affect that analysis is if the amount of code inside the body of the unrolled loop in Test1
was too much to fit in the cache. The loop bodies in Test2
are much smaller, even though the overall code is larger, so they're virtually guaranteed to be hot in the cache. Having the code in the cache is way better for performance. Hmmm, I guess we won't be able to tell without benchmarking after all.
Upvotes: 2