Reputation: 531
Let test_speed.c be the following C code :
#include <stdio.h>
int main(){
int i;
for(i=0; i < 1000000000; i++) {}
printf("%d", i);
}
I run in the terminal :
gcc -o test_speed test_speed.c
and then :
time ./test_speed
I get :
Now i run the following :
gcc -O3 -o test_speed test_speed.c
and then :
time ./test_speed
I get :
How can the second run be this fast ? Is it already computed during the compilation ?
Upvotes: 2
Views: 666
Reputation: 140307
that's because -O3
aggressive optimization assumes that
for(i=0; i < 1000000000; i++) {}
has no side effect (except for the value of i
) and removes the loop completely (directly setting i
to 1000000000
).
Disassembly (x86):
00000000 <_main>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 83 e4 f0 and $0xfffffff0,%esp
6: 83 ec 10 sub $0x10,%esp
9: e8 00 00 00 00 call e <_main+0xe>
e: c7 44 24 04 00 ca 9a movl $0x3b9aca00,0x4(%esp) <== 1000000000 in hex, no loop
15: 3b
16: c7 04 24 00 00 00 00 movl $0x0,(%esp)
1d: e8 00 00 00 00 call 22 <_main+0x22>
22: 31 c0 xor %eax,%eax
24: c9 leave
25: c3 ret
that optimization level is not suitable for calibrated active-CPU loops as you can see (the result is the same with -O2
, but the loop remains unoptimized with just -O
)
Upvotes: 7
Reputation: 225637
The compiler recognizes that the loop does nothing, and that removing it would not change the output of the program, so the loop was optimized away entirely.
Here's the assembly with -O0
:
.L3:
.loc 1 4 0 is_stmt 0 discriminator 3
addl $1, -4(%rbp)
.L2:
.loc 1 4 0 discriminator 1
cmpl $999999999, -4(%rbp) # loop
jle .L3
.loc 1 5 0 is_stmt 1
movl -4(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
.loc 1 6 0
leave
.cfi_def_cfa 7, 8
ret
And with -O3
:
main:
.LFB23:
.file 1 "x1.c"
.loc 1 2 0
.cfi_startproc
.LVL0:
subq $8, %rsp
.cfi_def_cfa_offset 16
.LBB4:
.LBB5:
.file 2 "/usr/include/x86_64-linux-gnu/bits/stdio2.h"
.loc 2 104 0
movl $1000000000, %edx # stored value, no loop
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
call __printf_chk
.LVL1:
.LBE5:
.LBE4:
.loc 1 6 0
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
You can see that in the -O3
case the loop is removed entirely and the final value of i
, 1000000000, is stored directly.
Upvotes: 2
Reputation:
A compiler only has to keep the observable behavior of a program. Counting a variable without any I/O, interaction, or just using its value isn't observable, so as your loop doesn't do anything, the optimizer just throws it away completely and directly assigns the final value.
Upvotes: 2
Reputation: 368599
gcc
"knows" that there is no body in the loop, and no dependency on any result, temporary or real -- so it removes the loop.
A good tool for analysis like this is godbolt.org which shows you the generated assembly. The difference between no optimization at all and the -O3
optmization is stark:
Upvotes: 3