DudeMan2000
DudeMan2000

Reputation: 93

Preprocessor Definitions VS Local Variables, Speed Difference

I just compiled the following C code to test out the gcc optimizer (using the -O3 flag), expecting that both functions would end up generating the same set of assembly instructions:

int test1(int a, int b)
{
#define x (a*a*a+b)
#define y (a*b*a+3*b)
        return x*x+x*y+y;
#undef x
#undef y
}

int test2(int a, int b)
{
        int x = a*a*a+b;
        int y = a*b*a+3*b;
        return x*x+x*y+y;
}

But I was surprised to find that they generated slightly different assembly, and that the execution time for test1 (the code using the preprocessor instead of local variables) was a bit faster.

I've heard people say that the compiler can optimize better than humans can, and that you should tell it exactly what you want it to do; man I guess they weren't kidding. I thought the compiler was supposed to kind of guess at the programmer's intended use of local variables and replace their use if necessary... is that a false assumption?

When writing code for performance, are you better off using preprocessor definitions for the sake of readability rather than local variables? I know it looks ugly as hell, but apparently it actually makes a difference, unless I'm missing something.

Here's the assembly I got, using "gcc test.c -O3 -S". My gcc version is 4.8.2; it looks like the assembly output is the same for most versions of gcc, but not on 4.7 or 4.8 versions for some reason

test1
        movl    %edi, %eax
        movl    %edi, %edx
        leal    (%rsi,%rsi,2), %ecx
        imull   %edi, %eax
        imull   %esi, %edx
        imull   %edi, %eax
        imull   %edi, %edx
        addl    %esi, %eax
        addl    %ecx, %edx
        leal    (%rax,%rdx), %ecx
        imull   %ecx, %eax
        addl    %edx, %eax
        ret

test2
        movl    %edi, %eax
        leal    (%rsi,%rsi,2), %edx
        imull   %edi, %eax
        imull   %edi, %eax
        leal    (%rax,%rsi), %ecx
        movl    %edi, %eax
        imull   %esi, %eax
        imull   %edi, %eax
        addl    %eax, %edx
        leal    (%rcx,%rdx), %eax
        imull   %ecx, %eax
        addl    %edx, %eax
        ret

Upvotes: 4

Views: 198

Answers (2)

Dirk
Dirk

Reputation: 277

The answer is twofold:

  1. Your statement about identical results is a misconception
  2. I cannot reproduce your results "test1 faster than test2".

Preprocessor misconception

The results should not be identical. The preprocessor acts on (transforms) the source before it is actually compiled by the compiler with whatever options.

You can inspect the result of the preprocessor by running gcc -E main.c for example, assuming you are using a GNU compiler and your sources above are stored in a file main.c. The relevant parts become:

int test1(int a, int b)
{
  return (a*a*a+b)*(a*a*a+b)+(a*a*a+b)*(a*b*a+3*b)+(a*b*a+3*b);
}

int test2(int a, int b)
{
  int x = a*a*a+b;
  int y = a*b*a+3*b;
  return x*x+x*y+y;
}

Obviously, the first version uses roughly two times more mathematical operations than the second one. Then the compiler and its optimiser come into play …

(NB: Ideally you could analyse the number of CPU cycles generated by the assembler code. Use e.g. gcc -S main.c and look at main.s; you probably know that. Version 2 should "win" in that case.)

Runtime testing and optimising

In order to compare our results, you should post your test code. When testing you need to average out short term fluctuations and time granularity limits of your CPU. Hence you are likely to run in loops over the same code.

int i=100000000;
while (--i>0) {
    int r;
    r = test1(3, 4);
    }

Without optimiser, test1 runs clearly about 20% slower than test2.

However, the optimiser will analyse also the calling code and can optimise away the multiple call with identical arguments or calls with unused variables (r in this case).

Therefore you must fool the compiler to effectively make the calls, alike

int r = 0;
while (--i>0) {
    r += test1(3, i);
    }

When I tried that, I get identical runtimes with a percent level precision. I.e. sometimes time1 is faster, sometimes time2 is faster, when I repeat the comparison several times.

You should look into the optimiser documentation to understand which optimising options you need to outsmart in your tests.

And I confirm what @Ville Krumlinde states: I get identical code for the assembly output, even with -O level optimisation (gcc 4.4.7 on my desktop). The code only contains 9 operations in assembler, which makes me believe that the optimiser "knows" enough about algebraic optimisation to simplify your formulas.

So you may just be taken by a fake optimiser effect of your test frame after all.

Upvotes: 1

Ville Krumlinde
Ville Krumlinde

Reputation: 7131

Trying your code at godbolt I get identical assembly for both functions with GCC, even with -O setting. Only by omitting -O flag I get different results. And this really is expected because the code is trivial to optimize.

Here is generated assembly using gcc 4.4.7 with -O flag. As you can see they are identical.

test1(int, int):
    movl    %edi, %eax
    imull   %edi, %eax
    imull   %eax, %edi
    addl    $3, %eax
    imull   %esi, %eax
    addl    %esi, %edi
    leal    (%rax,%rdi), %edx
    imull   %edi, %edx
    leal    (%rdx,%rax), %eax
    ret
test2(int, int):
    movl    %edi, %eax
    imull   %edi, %eax
    imull   %eax, %edi
    addl    $3, %eax
    imull   %esi, %eax
    addl    %esi, %edi
    leal    (%rax,%rdi), %edx
    imull   %edi, %edx
    leal    (%rdx,%rax), %eax
    ret

Upvotes: 4

Related Questions