Reputation: 48687
I am learning x86 assembler in order to write a compiler. In particular, I'm taking a variety of simple recursive functions and feeding them through different compilers (OCaml, GCC etc.) in order to get a better understanding of the kinds of assembler generated by different compilers.
I've got a trivial recursive integer Fibonacci function:
int fib(int x) { return (x < 2 ? x : fib(x-1)+fib(x-2)); }
My hand-coded assembly looks like this:
fib:
cmp eax, 2
jl fin
push eax
dec eax
call fib
push eax
mov eax, [esp+4]
add eax, -2
call fib
add eax, [esp]
add esp, 8
fin:
ret
Compiling that function to Intel-syntax assembler using gcc -O2
produces this enigmatic code:
_fib:
push edi
push esi
push ebx
sub esp, 16
mov edi, DWORD PTR [esp+32]
cmp edi, 1
jle L4
mov ebx, edi
xor esi, esi
L3:
lea eax, [ebx-1]
mov DWORD PTR [esp], eax
call _fib
sub ebx, 2
add esi, eax
cmp ebx, 1
jg L3
and edi, 1
L2:
lea eax, [esi+edi]
add esp, 16
pop ebx
pop esi
pop edi
ret
L4:
xor esi, esi
jmp L2
So I guess the calling convention is argument at [esp+4]
and return value in eax
. It starts by pushing edi
, esi
and ebx
. Then it claims another 16 bytes for a stack frame, enough for 4 temporary ints. Then edi
is read from [esp+32]
, which is the argument. If the argument is <=1
then it jumps to L4
which zeroes out (?) esi
before jumping back to L2
which sets eax=esi+edi
which is just the argument edi
. If the argument was >1
then the argument is copied into ebx
and zeroes esi
before falling through into L3
. In L3
, it sets eax=ebx-1
and stores the result (n-1) at esp
in the stack frame before recursing to calculate fib(n-1)
. The result is added to esi
, ebx
is set to n-2
and it loops back to L3
if ebx>1
otherwise it extracts the lower bit of edi
before falling through to L2
.
Why is this code so convoluted (e.g. is there a name for an optimization that has been done that I'm not seeing?)?
The recursive calls fib(n-2)
seem to have been replaced with a loop accumulating in esi
but that call wasn't in tail position so how was this done?
What is the purpose of the and edi, 1
?
What is the purpose of the mov DWORD PTR [esp], eax
?
Why is the stack frame so large?
Can you disassemble this algorithm back into C to make it clearer what is going on?
My preliminary impression is that GCC generates pretty poor x86 assembler. In this case, over 2× more code for equal performance (3.25s for fib(40) on this 1.6GHz Atom for both programs). Is that fair?
Upvotes: 5
Views: 2280
Reputation: 24447
This first part is ensuring registers that should be preserved according to the calling convention are not trashed. I would guess the calling convention used here is cdecl
.
_fib:
push edi
push esi
push ebx
sub esp, 16
DWORD PTR[esp+32]
is your x
:
mov edi, DWORD PTR [esp+32]
cmp edi, 1
jle L4
If x
is less than or equal to 1 (this corresponds to your x < 2
), then go to L4
which is:
L4:
xor esi, esi
jmp L2
This zeroes out esi
and branches to L2
:
L2:
lea eax, [esi+edi]
add esp, 16
pop ebx
pop esi
pop edi
ret
This sets eax
(the return value) with esi+edi
. Since esi
is 0 already, edi
is just loaded in the case of 0 and 1. This corresponds to x < 2 ? x
.
Now we look at the case when x
is not < 2
:
mov ebx, edi
xor esi, esi
L3:
lea eax, [ebx-1]
mov DWORD PTR [esp], eax
call _fib
First, x
is copied to ebx
, then esi
is zeroed.
Next, eax
is set to x - 1
. This value is moved to the top of the stack and _fib
called. This corresponds to fib(x-1)
.
sub ebx, 2
add esi, eax
This subtracts 2 from ebx
(x
). Then eax
(return value from _fib
call is added to esi
, which was set to 0 before). Hence esi
now holds the result of fib(x-1)
.
cmp ebx, 1
jg L3
and edi, 1
ebx
is compared to 1. If it is greater than 1, then we loop back to L3
. Otherwise (the case where it is 0 or 1), we perform and edi, 1
and fall through to L2
(we analysed what this does earlier already). The and edi, 1
is equivalent to performing a %2
on x
.
From a high level, this is what the code does:
x < 2
, then return x
.fib(x-...)
until x
is smaller than 2. In this case, fall through to the x < 2
case.The optimization is that GCC unwinds the cases where x >= 2
by doing them in a loop instead of recursively.
Upvotes: 2
Reputation: 55402
Quite simply, fib(x-2)
is equal to fib(x-3) + fib(x-4)
, fib(x-4)
is equal to fib(x-5) + fib(x-6)
etc. so fib(x)
is being calculated as fib(x-1) + fib(x-3) + fib(x-5) + ... + fib(x&1)
(fib(x&1)
equals x&1
) i.e. gcc has inlined the call to fib(x-2)
, which is quite a clever thing to do to a recursive function.
Upvotes: 3
Reputation: 488213
In addition to the comments above, note that the recursion has been unwound into a tail call by replacing:
return x < 2 ? x : fib(x - 2) + fib(x - 1);
with:
if ((xprime = x) < 2) {
acc = 0;
} else {
/* at this point we know x >= 2 */
acc = 0; /* start with 0 */
while (x > 1) {
acc += fib(x - 1); /* add fib(x-1) */
x -= 2; /* now we'll add fib(x-2) */
}
/* so at this point we know either x==1 or x==0 */
xprime = x == 1 ? 1 : 0; /* ie, x & 1 */
}
return xprime + acc;
I suspect this rather tricky loop arose from multiple optimization steps, not that I have fiddled with gcc optimization since about gcc 2.3 (it's all very different inside now!).
Upvotes: 7