Reputation: 93
I just compiled the following C code to test out the gcc optimizer (using the -O3 flag), expecting that both functions would end up generating the same set of assembly instructions:
int test1(int a, int b)
{
#define x (a*a*a+b)
#define y (a*b*a+3*b)
return x*x+x*y+y;
#undef x
#undef y
}
int test2(int a, int b)
{
int x = a*a*a+b;
int y = a*b*a+3*b;
return x*x+x*y+y;
}
But I was surprised to find that they generated slightly different assembly, and that the execution time for test1 (the code using the preprocessor instead of local variables) was a bit faster.
I've heard people say that the compiler can optimize better than humans can, and that you should tell it exactly what you want it to do; man I guess they weren't kidding. I thought the compiler was supposed to kind of guess at the programmer's intended use of local variables and replace their use if necessary... is that a false assumption?
When writing code for performance, are you better off using preprocessor definitions for the sake of readability rather than local variables? I know it looks ugly as hell, but apparently it actually makes a difference, unless I'm missing something.
Here's the assembly I got, using "gcc test.c -O3 -S". My gcc version is 4.8.2; it looks like the assembly output is the same for most versions of gcc, but not on 4.7 or 4.8 versions for some reason
test1
movl %edi, %eax
movl %edi, %edx
leal (%rsi,%rsi,2), %ecx
imull %edi, %eax
imull %esi, %edx
imull %edi, %eax
imull %edi, %edx
addl %esi, %eax
addl %ecx, %edx
leal (%rax,%rdx), %ecx
imull %ecx, %eax
addl %edx, %eax
ret
test2
movl %edi, %eax
leal (%rsi,%rsi,2), %edx
imull %edi, %eax
imull %edi, %eax
leal (%rax,%rsi), %ecx
movl %edi, %eax
imull %esi, %eax
imull %edi, %eax
addl %eax, %edx
leal (%rcx,%rdx), %eax
imull %ecx, %eax
addl %edx, %eax
ret
Upvotes: 4
Views: 198
Reputation: 277
test1
faster than test2
".The results should not be identical. The preprocessor acts on (transforms) the source before it is actually compiled by the compiler with whatever options.
You can inspect the result of the preprocessor by running gcc -E main.c
for example, assuming you are using a GNU compiler and your sources above are stored in a file main.c
. The relevant parts become:
int test1(int a, int b)
{
return (a*a*a+b)*(a*a*a+b)+(a*a*a+b)*(a*b*a+3*b)+(a*b*a+3*b);
}
int test2(int a, int b)
{
int x = a*a*a+b;
int y = a*b*a+3*b;
return x*x+x*y+y;
}
Obviously, the first version uses roughly two times more mathematical operations than the second one. Then the compiler and its optimiser come into play …
(NB: Ideally you could analyse the number of CPU cycles generated by the assembler code. Use e.g. gcc -S main.c
and look at main.s
; you probably know that. Version 2 should "win" in that case.)
In order to compare our results, you should post your test code. When testing you need to average out short term fluctuations and time granularity limits of your CPU. Hence you are likely to run in loops over the same code.
int i=100000000;
while (--i>0) {
int r;
r = test1(3, 4);
}
Without optimiser, test1
runs clearly about 20% slower than test2
.
However, the optimiser will analyse also the calling code and can optimise away the multiple call with identical arguments or calls with unused variables (r
in this case).
Therefore you must fool the compiler to effectively make the calls, alike
int r = 0;
while (--i>0) {
r += test1(3, i);
}
When I tried that, I get identical runtimes with a percent level precision. I.e. sometimes time1
is faster, sometimes time2
is faster, when I repeat the comparison several times.
You should look into the optimiser documentation to understand which optimising options you need to outsmart in your tests.
And I confirm what @Ville Krumlinde states: I get identical code for the assembly output, even with -O
level optimisation (gcc 4.4.7 on my desktop). The code only contains 9 operations in assembler, which makes me believe that the optimiser "knows" enough about algebraic optimisation to simplify your formulas.
So you may just be taken by a fake optimiser effect of your test frame after all.
Upvotes: 1
Reputation: 7131
Trying your code at godbolt I get identical assembly for both functions with GCC, even with -O setting. Only by omitting -O flag I get different results. And this really is expected because the code is trivial to optimize.
Here is generated assembly using gcc 4.4.7 with -O flag. As you can see they are identical.
test1(int, int):
movl %edi, %eax
imull %edi, %eax
imull %eax, %edi
addl $3, %eax
imull %esi, %eax
addl %esi, %edi
leal (%rax,%rdi), %edx
imull %edi, %edx
leal (%rdx,%rax), %eax
ret
test2(int, int):
movl %edi, %eax
imull %edi, %eax
imull %eax, %edi
addl $3, %eax
imull %esi, %eax
addl %esi, %edi
leal (%rax,%rdi), %edx
imull %edi, %edx
leal (%rdx,%rax), %eax
ret
Upvotes: 4