Reputation: 81
Below is a C function to evaluate a polynomial:
/* Calculate a0 + a1*x + a2*x^2 + ... + an*x^n */
/* from CSAPP Ex.5.5, modified to integer version */
int poly(int a[], int x, int degree) {
long int i;
int result = a[0];
int xpwr = x;
for (i = 1; i <= degree; ++i) {
result += a[i]*xpwr;
xpwr *= x;
}
return result;
}
And a main function:
#define TIMES 100000ll
int main(void) {
long long int i;
unsigned long long int result = 0;
for (i = 0; i < TIMES; ++i) {
/* g_a is an int[10000] global variable with all elements equals to 1 */
/* x = 2, i.e. evaluate 1 + 2 + 2^2 + ... + 2^9999 */
result += poly(g_a, 2, 9999);
}
printf("%lld\n", result);
return 0;
}
When I compile the program with GCC and options -O1 and -O2 separately, I found that -O1 is FASTER than -O2 a lot.
Platform details:
Result:
It seems that -O1 is approximately 10000 times faster than -O2.
When I test it on Mac (clang-600.0.56), the result is even more weird: -O1 takes no more than 0.02s even when TIMES = 1000000000000000000ll
I have tested the following changes:
And the results are the same.
I tried to look at the assembly code, it seems that -O1 is calling the poly function while -O2 does inline optimization. But inline should make the performance better, isn't it?
What makes these huge differences? Why -O1 on clang can make the program so fast? Is -O1 doing something wrong? (I cannot check the result as it is too slow without optimization)
Upvotes: 2
Views: 810
Reputation: 81
Here is the assembly code of main
for -O1
: (you may get it by adding -S
option to gcc)
main:
.LFB12:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $9999, %edx
movl $2, %esi
movl $g_a, %edi
call poly
movslq %eax, %rdx
movl $100000, %eax
.L6:
subq $1, %rax
jne .L6
imulq $100000, %rdx, %rsi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
And for -O2
:
main:
.LFB12:
.cfi_startproc
movl g_a(%rip), %r9d
movl $100000, %r8d
xorl %esi, %esi
.p2align 4,,10
.p2align 3
.L8:
movl $g_a+4, %eax
movl %r9d, %ecx
movl $2, %edx
.p2align 4,,10
.p2align 3
.L7:
movl (%rax), %edi
addq $4, %rax
imull %edx, %edi
addl %edx, %edx
addl %edi, %ecx
cmpq $g_a+40000, %rax
jne .L7
movslq %ecx, %rcx
addq %rcx, %rsi
subq $1, %r8
jne .L8
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC1, %edi
xorl %eax, %eax
call printf
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
Although I don't know much about assembly, it is obvious that -O1
is just calling poly
once, and multiply the result by 100000 (imulq $100000, %rdx, %rsi
). This is the reason that it is so fast.
It seems that gcc can detect that poly
is a pure function with no side effect. (It will be interesting if we have another thread modifying g_a
while poly
is running...)
On the other hand, -O2
has inlined the poly
function, so it has no chance to check poly
as a pure function.
I have further done some research:
I cannot find the actual flag used by -O1
which do the pure function checking.
I have tried all the flags listed by gcc -Q -O1 --help=optimizers
individually, but none of them have the effect.
Maybe it needs a combination of the flags together to get the effect, but it is very hard to try all the combinations.
But I have found the flag used by -O2
which makes the effect disappear, which is the -finline-small-functions
flag. The name of the flag explains itself.
Upvotes: 5
Reputation: 23448
One thing that jumps out at me is that you're overflowing signed integers. The behaviour of this is undefined in C. Specifically, int result
won't be able to hold pow(2,9999). I don't see what the point is of benchmarking code with undefined behaviour?
Upvotes: 0