Reputation: 71
Consider the given two cases,
In this case below I am just running two nested loops both initialized from 0
and running upto 100000
.
int k = 100000;
for(i=0;i<k;i++)
for(j=0;j<k;j++){
// Do nothing
}
time
on my system = 22.6 seconds
Again I am doing the same thing, just incrementing a variable c
inside.
int k = 100000, cnt=0;
for(i=0;i<k;i++)
for(j=0;j<k;j++){
cnt++;
}
time
on my system = 19.6 seconds
How come ??? Why is the time in case2 < case1
??
Upvotes: 3
Views: 193
Reputation: 5796
I just reproduced the results, and asked myself the same question as the OP.
here is the code:
>>>> test1.c
int
main ()
{
long long int i;
long long int j;
long long int k = 100000;
for(i=0;i<k;i++)
for(j=0;j<k;j++)
{
// Do nothing
}
return 0;
}
.
>>>> test2.c
int
main ()
{
long long int i;
long long int j;
long long int c = 0;
long long int k = 100000;
for(i=0;i<k;i++)
for(j=0;j<k;j++)
{
c++;
}
return 0;
}
compiled with gcc -o testx testx.c -g
on an amd64 gentoo linux machine.
When run, I get the following times:
test1: 0m32.000s
test2: 0m28.307s
I have tested this multiple times, and the derivation is surprisingly small.
To understand what happens here, we have to look at the disassembly.
>>>> test1
Dump of assembler code for function main:
0x00000000004004fc <+0>: push %rbp
0x00000000004004fd <+1>: mov %rsp,%rbp
0x0000000000400500 <+4>: movq $0x186a0,-0x18(%rbp)
0x0000000000400508 <+12>: movq $0x0,-0x8(%rbp)
0x0000000000400510 <+20>: jmp 0x400530 <main+52>
0x0000000000400512 <+22>: movq $0x0,-0x10(%rbp)
0x000000000040051a <+30>: jmp 0x400521 <main+37>
0x000000000040051c <+32>: addq $0x1,-0x10(%rbp)
0x0000000000400521 <+37>: mov -0x10(%rbp),%rax
0x0000000000400525 <+41>: cmp -0x18(%rbp),%rax
0x0000000000400529 <+45>: jl 0x40051c <main+32>
0x000000000040052b <+47>: addq $0x1,-0x8(%rbp)
0x0000000000400530 <+52>: mov -0x8(%rbp),%rax
0x0000000000400534 <+56>: cmp -0x18(%rbp),%rax
0x0000000000400538 <+60>: jl 0x400512 <main+22>
0x000000000040053a <+62>: mov $0x0,%eax
0x000000000040053f <+67>: pop %rbp
0x0000000000400540 <+68>: retq
End of assembler dump.
.
>>>> test2:
Dump of assembler code for function main:
0x00000000004004fc <+0>: push %rbp
0x00000000004004fd <+1>: ov %rsp,%rbp
0x0000000000400500 <+4>: movq $0x0,-0x18(%rbp)
0x0000000000400508 <+12>: movq $0x186a0,-0x20(%rbp)
0x0000000000400510 <+20>: movq $0x0,-0x8(%rbp)
0x0000000000400518 <+28>: jmp 0x40053d <main+65>
0x000000000040051a <+30>: movq $0x0,-0x10(%rbp)
0x0000000000400522 <+38>: jmp 0x40052e <main+50>
0x0000000000400524 <+40>: addq $0x1,-0x18(%rbp)
0x0000000000400529 <+45>: addq $0x1,-0x10(%rbp)
0x000000000040052e <+50>: mov -0x10(%rbp),%rax
0x0000000000400532 <+54>: cmp -0x20(%rbp),%rax
0x0000000000400536 <+58>: jl 0x400524 <main+40>
0x0000000000400538 <+60>: addq $0x1,-0x8(%rbp)
0x000000000040053d <+65>: mov -0x8(%rbp),%rax
0x0000000000400541 <+69>: cmp -0x20(%rbp),%rax
0x0000000000400545 <+73>: jl 0x40051a <main+30>
0x0000000000400547 <+75>: mov $0x0,%eax
0x000000000040054c <+80>: pop %rbp
0x000000000040054d <+81>: retq
End of assembler dump.
Which, as expected, looks very similar.
I have highlighted what the code does in a commented version of test2 below. The indentation of the assembly lines represents the level of the loop they are in, or that they implement.
>>>> test2:
Dump of assembler code for function main:
// setup the stackframe
0x00000000004004fc <+0>: push %rbp
0x00000000004004fd <+1>: ov %rsp,%rbp
// initialize variable c
0x0000000000400500 <+4>: movq $0x0,-0x18(%rbp)
// initialize variable k
0x0000000000400508 <+12>: movq $0x186a0,-0x20(%rbp)
// initialize variable i
0x0000000000400510 <+20>: movq $0x0,-0x8(%rbp)
// enter the outer loop
0x0000000000400518 <+28>: jmp 0x40053d <main+65>
// initialize variable j
0x000000000040051a <+30>: movq $0x0,-0x10(%rbp)
// enter the inner loop
0x0000000000400522 <+38>: jmp 0x40052e <main+50>
// increment variable c
0x0000000000400524 <+40>: addq $0x1,-0x18(%rbp)
// increment variable j
0x0000000000400529 <+45>: addq $0x1,-0x10(%rbp)
// check if the inner loop condition still holds
0x000000000040052e <+50>: mov -0x10(%rbp),%rax
0x0000000000400532 <+54>: cmp -0x20(%rbp),%rax
// jump to the start of the inner loop, if true, else continue
0x0000000000400536 <+58>: jl 0x400524 <main+40>
// increment variable i
0x0000000000400538 <+60>: addq $0x1,-0x8(%rbp)
// check if the outer loop condition still holds
0x000000000040053d <+65>: mov -0x8(%rbp),%rax
0x0000000000400541 <+69>: cmp -0x20(%rbp),%rax
// jump to the start of the outer loop, if true, else continue
0x0000000000400545 <+73>: jl 0x40051a <main+30>
// tear down and return to main
0x0000000000400547 <+75>: mov $0x0,%eax
0x000000000040054c <+80>: pop %rbp
0x000000000040054d <+81>: retq
End of assembler dump.
As you can see, the code structure is very similar to the actual C code, and the differences between the assembly of test1 and test2 are very little.
The reason why test2 performs marginally faster is probably buried deeply in the specification of your hardware. I think it could be possible that modern processors have optimized instruction caching and pipelining for simple loops, because these are so commonly found in programs, and that the optimization does not apply to empty loops, as they are (1) very rare in actual programs and (2) runtime optimization does not actually matter for empty loops, as they are usually intended for (busy) waiting.
Whatever the reason, as academically interesting it might be, The impact on real software would probably be nonexistent :)
I just found this document published by Intel, that should be an interesting read, if you are interested in the details http://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&ved=0CFgQFjAD&url=http%3A%2F%2Fwww.agner.org%2Foptimize%2Fmicroarchitecture.pdf&ei=8-sVUtWyM8nPtAb4ooCQBQ&usg=AFQjCNGRPm4A8ixWqSSGOOtNPCxp1YRfQg&sig2=Qe6Nxmz4Lee5Oo8UOGwTJw&bvm=bv.51156542,d.Yms
Upvotes: 5
Reputation: 60275
When CPU designers are considering performance these days, they try to get a good idea of the code their processor is going to run and then they design their chip to run as fast as possible overall on that workload. In the league they're playing in, that means making tradeoffs. So code more common runs faster, and that improves overall performance.
Upvotes: 3
Reputation: 32681
Probably because of the optimizations made by your compiler.
Modern compilers are quite good at optimizing loops.
Actually I am surprised that the first case took so long, most compilers optimizers will see that your loop does nothing and directly affect i=j=k^2
without the burden of compiling it as an empty loop full of jumping (JNZ in assembler) instructions, and the second case will probably just affect cnt=10e8
(probably overflowed) too as the compiler is smart enough to known that this will be the end result of the loop and that the loop doesn't do anything else and doesn't trigger any other side-effect so can be skipped.
Try running your test again with -O0
to turn off compiler optimizations.
Actually you didn't mention the compiler and compiler options you are using to perform your test so any conclusion is useless so far.
Upvotes: 0
Reputation: 3417
If, as @JesseJ comments above, repeated trials yield the same results, it's more than likely a difference in how the code is optimized by the compiler. If you compile with optimization off you would likely get your expected results (case2 > case1
). With the optimizer on, all bets are off.
Upvotes: 0