Reputation: 5746
I have two files:
#include <stdio.h>
static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }
int main()
{
unsigned int input;
scanf("%u", &input);
switch (input)
{
case 0: print0(); break;
case 1: print1(); break;
case 2: print2(); break;
case 3: print3(); break;
case 4: print4(); break;
}
return 0;
}
and
#include <stdio.h>
static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }
int main()
{
unsigned int input;
scanf("%u", &input);
static void (*jt[])() = { print0, print1, print2, print3, print4 };
jt[input]();
return 0;
}
I expected them to be compiled to almost identical assembly code. In both cases jump tables are generated, but the calls in the first file are represented by jmp
, while the calls in the second one by call
. Why doesn't compiler optimise call
s? Is is possible to hint gcc that I would like to see jmp
s instead of call
s?
Compiled with gcc -Wall -Winline -O3 -S -masm=intel
, GCC version 4.6.2. GCC 4.8.0 produces slightly less code, but the problem still persists.
UPD: Defining jt
as const void (* const jt[])() = { print0, print1, print2, print3, print4 };
and making the functions static const inline
didn't help: http://ideone.com/97SU0
Upvotes: 10
Views: 854
Reputation: 18227
The first case (through the switch()
) creates the following for me (Linux x86_64 / gcc 4.4):
400570: ff 24 c5 b8 06 40 00 jmpq *0x4006b8(,%rax,8)
[ ... ]
400580: 31 c0 xor %eax,%eax
400582: e8 e1 fe ff ff callq 400468 <printf@plt>
400587: 31 c0 xor %eax,%eax
400589: 48 83 c4 08 add $0x8,%rsp
40058d: c3 retq
40058e: bf a4 06 40 00 mov $0x4006a4,%edi
400593: eb eb jmp 400580 <main+0x30>
400595: bf a9 06 40 00 mov $0x4006a9,%edi
40059a: eb e4 jmp 400580 <main+0x30>
40059c: bf ad 06 40 00 mov $0x4006ad,%edi
4005a1: eb dd jmp 400580 <main+0x30>
4005a3: bf b1 06 40 00 mov $0x4006b1,%edi
4005a8: eb d6 jmp 400580 <main+0x30>
[ ... ]
Contents of section .rodata:
[ ... ]
4006b8 8e054000 p ... ]
Note the .rodata
contents @4006b8
are printed network byte order (for whatever reason ...), the value is 40058e
which is within main
above - where the arg-initializer/jmp
block starts. All the mov
/jmp
pairs in there use eight bytes, hence the (,%rax,8)
indirection. In this case, the sequence is therefore:
jmp <to location that sets arg for printf()>
...
jmp <back to common location for the printf() invocation>
...
call <printf>
...
retq
This means the compiler has actually optimized out the static
call sites - and instead merged them all into a single, inlined printf()
call. The table use here is the jmp ...(,%rax,8)
instruction, and the table contained within the program code.
The second one (with the explicitly-created table) does the following for me:
0000000000400550 <print0>:
[ ... ]
0000000000400560 <print1>:
[ ... ]
0000000000400570 <print2>:
[ ... ]
0000000000400580 <print3>:
[ ... ]
0000000000400590 <print4>:
[ ... ]
00000000004005a0 <main>:
4005a0: 48 83 ec 08 sub $0x8,%rsp
4005a4: bf d4 06 40 00 mov $0x4006d4,%edi
4005a9: 31 c0 xor %eax,%eax
4005ab: 48 8d 74 24 04 lea 0x4(%rsp),%rsi
4005b0: e8 c3 fe ff ff callq 400478 <scanf@plt>
4005b5: 8b 54 24 04 mov 0x4(%rsp),%edx
4005b9: 31 c0 xor %eax,%eax
4005bb: ff 14 d5 60 0a 50 00 callq *0x500a60(,%rdx,8)
4005c2: 31 c0 xor %eax,%eax
4005c4: 48 83 c4 08 add $0x8,%rsp
4005c8: c3 retq
[ ... ]
500a60 50054000 00000000 60054000 00000000 P.@.....`.@.....
500a70 70054000 00000000 80054000 00000000 p.@.......@.....
500a80 90054000 00000000 ..@.....
Again, note the inverted byte order as objdump prints the data section - if you turn these around you get the function adresses for print[0-4]()
.
The compiler is invoking the target through an indirect call
- i.e. the table usage is directly in the call
instruction, and the table has _explicitly been created as data.
Edit:
If you change the source like this:
#include <stdio.h>
static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }
void main(int argc, char **argv)
{
static void (*jt[])() = { print0, print1, print2, print3, print4 };
return jt[argc]();
}
the created assembly for main()
becomes:
0000000000400550 <main>:
400550: 48 63 ff movslq %edi,%rdi
400553: 31 c0 xor %eax,%eax
400555: 4c 8b 1c fd e0 09 50 mov 0x5009e0(,%rdi,8),%r11
40055c: 00
40055d: 41 ff e3 jmpq *%r11d
which looks more like what you wanted ?
The reason for this is that you need "stackless" funcs to be able to do this - tail-recursion (returning from a function via jmp
instead of ret
) is only possible if you either have done all stack cleanup already, or don't have to do any because you have nothing to clean up on the stack. The compiler can (but needs not) choose to clean up before the last function call (in which case the last call can be made by jmp
), but that's only possible if you return either the value you got from that function, or if you "return void
". And, as said, if you actually use stack (like your example does for the input
variable) there's nothing that can make the compiler force to undo this in such a way that tail-recursion results.
Edit2:
The disassembly for the first example, with the same changes (argc
instead of input
and forcing void main
- no standard-conformance comments please this is a demo), results in the following assembly:
0000000000400500 <main>:
400500: 83 ff 04 cmp $0x4,%edi
400503: 77 0b ja 400510 <main+0x10>
400505: 89 f8 mov %edi,%eax
400507: ff 24 c5 58 06 40 00 jmpq *0x400658(,%rax,8)
40050e: 66 data16
40050f: 90 nop
400510: f3 c3 repz retq
400512: bf 3c 06 40 00 mov $0x40063c,%edi
400517: 31 c0 xor %eax,%eax
400519: e9 0a ff ff ff jmpq 400428 <printf@plt>
40051e: bf 41 06 40 00 mov $0x400641,%edi
400523: 31 c0 xor %eax,%eax
400525: e9 fe fe ff ff jmpq 400428 <printf@plt>
40052a: bf 46 06 40 00 mov $0x400646,%edi
40052f: 31 c0 xor %eax,%eax
400531: e9 f2 fe ff ff jmpq 400428 <printf@plt>
400536: bf 4a 06 40 00 mov $0x40064a,%edi
40053b: 31 c0 xor %eax,%eax
40053d: e9 e6 fe ff ff jmpq 400428 <printf@plt>
400542: bf 4e 06 40 00 mov $0x40064e,%edi
400547: 31 c0 xor %eax,%eax
400549: e9 da fe ff ff jmpq 400428 <printf@plt>
40054e: 90 nop
40054f: 90 nop
This is worse in one way (does two jmp
instead of one) but better in another (because it eliminates the static
functions and inlines the code). Optimization-wise, the compiler has pretty much done the same thing.
Upvotes: 2
Reputation: 81237
Does the code for the latter function do nothing between the indirect call
and the succeeding ret
? I would not be surprised if the address computation for the indirect call makes use of a register whose value which the latter function is required to preserve (meaning it must save the value before the computation, and restore it some time after). While it might be possible to move the register-restore code before the indirect call, a compiler can only perform such code motion in those cases which it has been been programmed to recognize as legitimate opportunities.
Also, while I don't think it matters, I'd suggest that the routines shouldn't be inline
since the compiler won't be able to execute them that way.
Upvotes: 1
Reputation: 340326
Have you profiled the different code? I think an argument might be made that the indirect call is optimized. The following analysis was done with GCC 4.6.1 targeting an x64 platform (MinGW).
If you look at what happens when jt[input]()
is used, a call results in the following sequence of code being executed:
printX()
functionsprintX()
function sets up the argument for printf()
, thenprintf()
printf()
call will return directly to the site of the `indirect call.For a total of 3 branches.
When you use the switch statement what happens is:
printX()
calls)printf()
callprintf()
printf()
call will return to the 'case handler' whichFor a total of 4 branches (in the general case).
In both situations you have:
- an indirect branch (for one it's a call, in the other a jump)
- a branch to the printf()
(for one it's a jump, in the other a call)
- a branch back to the call site
However, when the switch
statement is used there's an additional branch to get to the 'end' of the switch (in most cases).
Now, it's possible that if you actually profiled things, the processor might handle an indirect jump faster than an indirect call, but I'd guess that even if that's the case the additional branch used in the switch-based code would still push the scales in favor of the call through the function pointer.
For those interested, here's the assembler generated using jk[input]();
(both examples compiled with GCC MinGW 4.6.1 targeting x64, options used were -Wall -Winline -O3 -S -masm=intel
):
print0:
.seh_endprologue
lea rcx, .LC4[rip]
jmp printf
.seh_endproc
// similar code is generated for each printX() function
// ...
main:
sub rsp, 56
.seh_stackalloc 56
.seh_endprologue
call __main
lea rdx, 44[rsp]
lea rcx, .LC5[rip]
call scanf
mov edx, DWORD PTR 44[rsp]
lea rax, jt.2423[rip]
call [QWORD PTR [rax+rdx*8]]
xor eax, eax
add rsp, 56
ret
And here is the code generated for the switch-based implementation:
main:
sub rsp, 56
.seh_stackalloc 56
.seh_endprologue
call __main
lea rdx, 44[rsp]
lea rcx, .LC0[rip]
call scanf
cmp DWORD PTR 44[rsp], 4
ja .L2
mov edx, DWORD PTR 44[rsp]
lea rax, .L8[rip]
movsx rdx, DWORD PTR [rax+rdx*4]
add rax, rdx
jmp rax
.section .rdata,"dr"
.align 4
.L8:
.long .L3-.L8
.long .L4-.L8
.long .L5-.L8
.long .L6-.L8
.long .L7-.L8
.section .text.startup,"x"
.L7:
lea rcx, .LC5[rip]
call printf
.p2align 4,,10
.L2:
xor eax, eax
add rsp, 56
ret
.L6:
lea rcx, .LC4[rip]
call printf
jmp .L2
// all the other cases are essentially the same as the one above (.L6)
// where they jump to .L2 to exit instead of simply falling through to it
// like .L7 does
Upvotes: 1
Reputation: 726849
My guess is that this optimization has to do with the fact that you have a return
statement immediately after your switch
: optimizer realizes that it could piggyback on the returns embedded into your print0
..print4
functions, and reduces call
to jmp
; the ret
the CPU hits inside the selected printN
serves as the return from the main
.
Try inserting some code after the switch to see if the compiler would replace jmp
with call
.
#include <stdio.h>
static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }
int main()
{
unsigned int input;
scanf("%u", &input);
switch (input)
{
case 0: print0(); break;
case 1: print1(); break;
case 2: print2(); break;
case 3: print3(); break;
case 4: print4(); break;
}
/* Inserting this line should force the compiler to use call */
printf("\nDone");
return 0;
}
EDIT :
Your code on ideone has a jmp
for a different reason: it's an equivalent of this:
static const char* LC0 ="Zero";
static const char* LC1 ="One";
static const char* LC2 ="Two";
static const char* LC3 ="Three";
static const char* LC4 ="Four";
int main()
{
unsigned int input;
scanf("%u", &input);
switch (input)
{
case 0: printf(LC0); break;
case 1: printf(LC1); break;
case 2: printf(LC2); break;
case 3: printf(LC3); break;
case 4: printf(LC4); break;
}
printf("\nDone");
return 0;
}
Upvotes: 2
Reputation: 92321
Compiler writers have a lot of work to do. Obviously they prioritize the work that has the biggest and fastest payoff.
Switch statements are common in all kinds of code, so any optimizations performed on them will have an effect on lots of programs.
This code
jt[input]();
is a lot less common, and therefore a lot longer down on the compiler designers' TODO-list. Perhaps they haven't (yet) found it worth the effort to try to optimize it? Will that win them any known benchmarks? Or improve some widely used codebase?
Upvotes: 8
Reputation: 118600
Because the array of function pointers is mutable. The compiler has decided it can't assume the pointers won't be changed. You might find the assembly different for C++, and/or make jt const.
Upvotes: 4