Below is a C function to evaluate a polynomial: /* Calculate a0 + a1*x + a2*x^2 + ... + an*x^n */ /* from CSAPP Ex.5.5, modified to integer version */ int poly(int a[], int x, int degree) { long int i; int result = a[0]; int xpwr = x; for (i = 1; i <= degree; ++i) { result += a[i]*xpwr; xpwr *= x; } return result; } And a main function: #define TIMES 100000ll int main(void) { long long int i; unsigned long long int result = 0; for (i = 0; i < TIMES; ++i) { /* g_a is an int[10000] global variable with all elements equals to 1 */ /* x = 2, i.e. evaluate 1 + 2 + 2^2 + ... + 2^9999 */ result += poly(g_a, 2, 9999); } printf("%lld\n", result); return 0; } When I compile the program with GCC and options -O1 and -O2 separately, I found that -O1 is FASTER than -O2 a lot. Platform details: i5-4600 Arch Linux x86_64 with kernel 3.18 GCC 4.9.2 gcc -O1 -o /tmp/a.out test.c gcc -O2 -o /tmp/a.out test.c Result: When TIMES = 100000ll, -O1 prints the result instantly, while -O2 needs 0.36s When TIMES = 1000000000ll, -O1 prints the result in 0.28s, -O2 takes so long that I didn't finish the test It seems that -O1 is approximately 10000 times faster than -O2. When I test it on Mac (clang-600.0.56), the result is even more weird: -O1 takes no more than 0.02s even when TIMES = 1000000000000000000ll I have tested the following changes: makes g_a random (elements are from 1 to 10) x = 19234 (or some other number) use int instead of long long int And the results are the same. I tried to look at the assembly code, it seems that -O1 is calling the poly function while -O2 does inline optimization. But inline should make the performance better, isn't it? What makes these huge differences? Why -O1 on clang can make the program so fast? Is -O1 doing something wrong? (I cannot check the result as it is too slow without optimization)

Reputation: 81

Why -O1 is faster than -O2 for 10000 times?

Below is a C function to evaluate a polynomial:

/* Calculate a0 + a1*x + a2*x^2 + ... + an*x^n */
/* from CSAPP Ex.5.5, modified to integer version */
int poly(int a[], int x, int degree) {
  long int i;
  int result = a[0];
  int xpwr = x;
  for (i = 1; i <= degree; ++i) {
    result += a[i]*xpwr;
    xpwr *= x;
  }
  return result;
}

And a main function:

#define TIMES 100000ll
int main(void) {
  long long int i;
  unsigned long long int result = 0;
  for (i = 0; i < TIMES; ++i) {
    /* g_a is an int[10000] global variable with all elements equals to 1 */
    /* x = 2, i.e. evaluate 1 + 2 + 2^2 + ... + 2^9999 */
    result += poly(g_a, 2, 9999);
  }
  printf("%lld\n", result);
  return 0;
}

When I compile the program with GCC and options -O1 and -O2 separately, I found that -O1 is FASTER than -O2 a lot.

Platform details:

i5-4600
Arch Linux x86_64 with kernel 3.18
GCC 4.9.2
gcc -O1 -o /tmp/a.out test.c
gcc -O2 -o /tmp/a.out test.c

Result:

When TIMES = 100000ll, -O1 prints the result instantly, while -O2 needs 0.36s
When TIMES = 1000000000ll, -O1 prints the result in 0.28s, -O2 takes so long that I didn't finish the test

It seems that -O1 is approximately 10000 times faster than -O2.

When I test it on Mac (clang-600.0.56), the result is even more weird: -O1 takes no more than 0.02s even when TIMES = 1000000000000000000ll

I have tested the following changes:

makes g_a random (elements are from 1 to 10)
x = 19234 (or some other number)
use int instead of long long int

And the results are the same.

I tried to look at the assembly code, it seems that -O1 is calling the poly function while -O2 does inline optimization. But inline should make the performance better, isn't it?

What makes these huge differences? Why -O1 on clang can make the program so fast? Is -O1 doing something wrong? (I cannot check the result as it is too slow without optimization)

Upvotes: 2

Answers (2)

nnkken

Reputation: 81

Here is the assembly code of main for -O1: (you may get it by adding -S option to gcc)

main:
.LFB12:
    .cfi_startproc
    subq    $8, %rsp
    .cfi_def_cfa_offset 16
    movl    $9999, %edx
    movl    $2, %esi
    movl    $g_a, %edi
    call    poly
    movslq  %eax, %rdx
    movl    $100000, %eax
.L6:
    subq    $1, %rax
    jne .L6
    imulq   $100000, %rdx, %rsi
    movl    $.LC0, %edi
    movl    $0, %eax
    call    printf
    movl    $0, %eax
    addq    $8, %rsp
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc

And for -O2:

main:
.LFB12:
    .cfi_startproc
    movl    g_a(%rip), %r9d
    movl    $100000, %r8d
    xorl    %esi, %esi
    .p2align 4,,10
    .p2align 3
.L8:
    movl    $g_a+4, %eax
    movl    %r9d, %ecx
    movl    $2, %edx
    .p2align 4,,10
    .p2align 3
.L7:
    movl    (%rax), %edi
    addq    $4, %rax
    imull   %edx, %edi
    addl    %edx, %edx
    addl    %edi, %ecx
    cmpq    $g_a+40000, %rax
    jne .L7
    movslq  %ecx, %rcx
    addq    %rcx, %rsi
    subq    $1, %r8
    jne .L8
    subq    $8, %rsp
    .cfi_def_cfa_offset 16
    movl    $.LC1, %edi
    xorl    %eax, %eax
    call    printf
    xorl    %eax, %eax
    addq    $8, %rsp
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc

Although I don't know much about assembly, it is obvious that -O1 is just calling poly once, and multiply the result by 100000 (imulq $100000, %rdx, %rsi). This is the reason that it is so fast.

It seems that gcc can detect that poly is a pure function with no side effect. (It will be interesting if we have another thread modifying g_a while poly is running...)

On the other hand, -O2 has inlined the poly function, so it has no chance to check poly as a pure function.

I have further done some research:

I cannot find the actual flag used by -O1 which do the pure function checking.

I have tried all the flags listed by gcc -Q -O1 --help=optimizers individually, but none of them have the effect.

Maybe it needs a combination of the flags together to get the effect, but it is very hard to try all the combinations.

But I have found the flag used by -O2 which makes the effect disappear, which is the -finline-small-functions flag. The name of the flag explains itself.

Upvotes: 5

pmdj

Reputation: 23448

One thing that jumps out at me is that you're overflowing signed integers. The behaviour of this is undefined in C. Specifically, int result won't be able to hold pow(2,9999). I don't see what the point is of benchmarking code with undefined behaviour?

Upvotes: 0

Why -O1 is faster than -O2 for 10000 times?

Answers (2)

Related Questions