skink
skink

Reputation: 5746

Why calls when jmps would suffice?

I have two files:

#include <stdio.h>

static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }

int main()
{
    unsigned int input;
    scanf("%u", &input);

    switch (input)
    {
        case 0: print0(); break;
        case 1: print1(); break;
        case 2: print2(); break;
        case 3: print3(); break;
        case 4: print4(); break;
    }
    return 0;
}

and

#include <stdio.h>

static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }

int main()
{
    unsigned int input;
    scanf("%u", &input);

    static void (*jt[])() = { print0, print1, print2, print3, print4 };
    jt[input]();
    return 0;
}

I expected them to be compiled to almost identical assembly code. In both cases jump tables are generated, but the calls in the first file are represented by jmp, while the calls in the second one by call. Why doesn't compiler optimise calls? Is is possible to hint gcc that I would like to see jmps instead of calls?

Compiled with gcc -Wall -Winline -O3 -S -masm=intel, GCC version 4.6.2. GCC 4.8.0 produces slightly less code, but the problem still persists.

UPD: Defining jt as const void (* const jt[])() = { print0, print1, print2, print3, print4 }; and making the functions static const inline didn't help: http://ideone.com/97SU0

Upvotes: 10

Views: 854

Answers (6)

FrankH.
FrankH.

Reputation: 18227

The first case (through the switch()) creates the following for me (Linux x86_64 / gcc 4.4):

  400570:       ff 24 c5 b8 06 40 00    jmpq   *0x4006b8(,%rax,8)
[ ... ]
  400580:       31 c0                   xor    %eax,%eax
  400582:       e8 e1 fe ff ff          callq  400468 <printf@plt>
  400587:       31 c0                   xor    %eax,%eax
  400589:       48 83 c4 08             add    $0x8,%rsp
  40058d:       c3                      retq
  40058e:       bf a4 06 40 00          mov    $0x4006a4,%edi
  400593:       eb eb                   jmp    400580 <main+0x30>
  400595:       bf a9 06 40 00          mov    $0x4006a9,%edi
  40059a:       eb e4                   jmp    400580 <main+0x30>
  40059c:       bf ad 06 40 00          mov    $0x4006ad,%edi
  4005a1:       eb dd                   jmp    400580 <main+0x30>
  4005a3:       bf b1 06 40 00          mov    $0x4006b1,%edi
  4005a8:       eb d6                   jmp    400580 <main+0x30>
[ ... ]
Contents of section .rodata:
[ ... ]
 4006b8 8e054000 p ... ]

Note the .rodata contents @4006b8 are printed network byte order (for whatever reason ...), the value is 40058e which is within main above - where the arg-initializer/jmp block starts. All the mov/jmp pairs in there use eight bytes, hence the (,%rax,8) indirection. In this case, the sequence is therefore:

jmp <to location that sets arg for printf()>
...
jmp <back to common location for the printf() invocation>
...
call <printf>
...
retq

This means the compiler has actually optimized out the static call sites - and instead merged them all into a single, inlined printf() call. The table use here is the jmp ...(,%rax,8) instruction, and the table contained within the program code.

The second one (with the explicitly-created table) does the following for me:

0000000000400550 <print0>:
[ ... ]
0000000000400560 <print1>:
[ ... ]
0000000000400570 <print2>:
[ ... ]
0000000000400580 <print3>:
[ ... ]
0000000000400590 <print4>:
[ ... ]
00000000004005a0 <main>:
  4005a0:       48 83 ec 08             sub    $0x8,%rsp
  4005a4:       bf d4 06 40 00          mov    $0x4006d4,%edi
  4005a9:       31 c0                   xor    %eax,%eax
  4005ab:       48 8d 74 24 04          lea    0x4(%rsp),%rsi
  4005b0:       e8 c3 fe ff ff          callq  400478 <scanf@plt>
  4005b5:       8b 54 24 04             mov    0x4(%rsp),%edx
  4005b9:       31 c0                   xor    %eax,%eax
  4005bb:       ff 14 d5 60 0a 50 00    callq  *0x500a60(,%rdx,8)
  4005c2:       31 c0                   xor    %eax,%eax
  4005c4:       48 83 c4 08             add    $0x8,%rsp
  4005c8:       c3                      retq
[ ... ]
 500a60 50054000 00000000 60054000 00000000  P.@.....`.@.....
 500a70 70054000 00000000 80054000 00000000  p.@.......@.....
 500a80 90054000 00000000                    ..@.....

Again, note the inverted byte order as objdump prints the data section - if you turn these around you get the function adresses for print[0-4]().

The compiler is invoking the target through an indirect call - i.e. the table usage is directly in the call instruction, and the table has _explicitly been created as data.

Edit:
If you change the source like this:

#include <stdio.h>

static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }

void main(int argc, char **argv)
{
    static void (*jt[])() = { print0, print1, print2, print3, print4 };
    return jt[argc]();
}

the created assembly for main() becomes:

0000000000400550 <main>:
  400550:       48 63 ff                movslq %edi,%rdi
  400553:       31 c0                   xor    %eax,%eax
  400555:       4c 8b 1c fd e0 09 50    mov    0x5009e0(,%rdi,8),%r11
  40055c:       00
  40055d:       41 ff e3                jmpq   *%r11d

which looks more like what you wanted ?

The reason for this is that you need "stackless" funcs to be able to do this - tail-recursion (returning from a function via jmp instead of ret) is only possible if you either have done all stack cleanup already, or don't have to do any because you have nothing to clean up on the stack. The compiler can (but needs not) choose to clean up before the last function call (in which case the last call can be made by jmp), but that's only possible if you return either the value you got from that function, or if you "return void". And, as said, if you actually use stack (like your example does for the input variable) there's nothing that can make the compiler force to undo this in such a way that tail-recursion results.

Edit2:

The disassembly for the first example, with the same changes (argc instead of input and forcing void main - no standard-conformance comments please this is a demo), results in the following assembly:

0000000000400500 <main>:
  400500:       83 ff 04                cmp    $0x4,%edi
  400503:       77 0b                   ja     400510 <main+0x10>
  400505:       89 f8                   mov    %edi,%eax
  400507:       ff 24 c5 58 06 40 00    jmpq   *0x400658(,%rax,8)
  40050e:       66                      data16
  40050f:       90                      nop
  400510:       f3 c3                   repz retq
  400512:       bf 3c 06 40 00          mov    $0x40063c,%edi
  400517:       31 c0                   xor    %eax,%eax
  400519:       e9 0a ff ff ff          jmpq   400428 <printf@plt>
  40051e:       bf 41 06 40 00          mov    $0x400641,%edi
  400523:       31 c0                   xor    %eax,%eax
  400525:       e9 fe fe ff ff          jmpq   400428 <printf@plt>
  40052a:       bf 46 06 40 00          mov    $0x400646,%edi
  40052f:       31 c0                   xor    %eax,%eax
  400531:       e9 f2 fe ff ff          jmpq   400428 <printf@plt>
  400536:       bf 4a 06 40 00          mov    $0x40064a,%edi
  40053b:       31 c0                   xor    %eax,%eax
  40053d:       e9 e6 fe ff ff          jmpq   400428 <printf@plt>
  400542:       bf 4e 06 40 00          mov    $0x40064e,%edi
  400547:       31 c0                   xor    %eax,%eax
  400549:       e9 da fe ff ff          jmpq   400428 <printf@plt>
  40054e:       90                      nop
  40054f:       90                      nop

This is worse in one way (does two jmp instead of one) but better in another (because it eliminates the static functions and inlines the code). Optimization-wise, the compiler has pretty much done the same thing.

Upvotes: 2

supercat
supercat

Reputation: 81237

Does the code for the latter function do nothing between the indirect call and the succeeding ret? I would not be surprised if the address computation for the indirect call makes use of a register whose value which the latter function is required to preserve (meaning it must save the value before the computation, and restore it some time after). While it might be possible to move the register-restore code before the indirect call, a compiler can only perform such code motion in those cases which it has been been programmed to recognize as legitimate opportunities.

Also, while I don't think it matters, I'd suggest that the routines shouldn't be inline since the compiler won't be able to execute them that way.

Upvotes: 1

Michael Burr
Michael Burr

Reputation: 340326

Have you profiled the different code? I think an argument might be made that the indirect call is optimized. The following analysis was done with GCC 4.6.1 targeting an x64 platform (MinGW).

If you look at what happens when jt[input]() is used, a call results in the following sequence of code being executed:

  • the indirect call to one of the printX() functions
  • the printX() function sets up the argument for printf(), then
  • jumps to printf()
  • the printf() call will return directly to the site of the `indirect call.

For a total of 3 branches.

When you use the switch statement what happens is:

  • an indirect jump to a bit of custom code for each case (inlined printX() calls)
  • the 'case handler' loads the appropriate argument for the printf() call
  • calls printf()
  • the printf() call will return to the 'case handler' which
  • jumps to the exit point of the switch (except for one case handler where the exit code is inlined - the other cases jump there)

For a total of 4 branches (in the general case).

In both situations you have: - an indirect branch (for one it's a call, in the other a jump) - a branch to the printf() (for one it's a jump, in the other a call) - a branch back to the call site

However, when the switch statement is used there's an additional branch to get to the 'end' of the switch (in most cases).

Now, it's possible that if you actually profiled things, the processor might handle an indirect jump faster than an indirect call, but I'd guess that even if that's the case the additional branch used in the switch-based code would still push the scales in favor of the call through the function pointer.


For those interested, here's the assembler generated using jk[input](); (both examples compiled with GCC MinGW 4.6.1 targeting x64, options used were -Wall -Winline -O3 -S -masm=intel):

print0:
    .seh_endprologue
    lea rcx, .LC4[rip]
    jmp printf
    .seh_endproc

// similar code is generated for each printX() function
// ...

main:
    sub rsp, 56
    .seh_stackalloc 56
    .seh_endprologue
    call    __main
    lea rdx, 44[rsp]
    lea rcx, .LC5[rip]
    call    scanf
    mov edx, DWORD PTR 44[rsp]
    lea rax, jt.2423[rip]
    call    [QWORD PTR [rax+rdx*8]]
    xor eax, eax
    add rsp, 56
    ret

And here is the code generated for the switch-based implementation:

main:
    sub rsp, 56
    .seh_stackalloc 56
    .seh_endprologue
    call    __main
    lea rdx, 44[rsp]
    lea rcx, .LC0[rip]
    call    scanf
    cmp DWORD PTR 44[rsp], 4
    ja  .L2
    mov edx, DWORD PTR 44[rsp]
    lea rax, .L8[rip]
    movsx   rdx, DWORD PTR [rax+rdx*4]
    add rax, rdx
    jmp rax
    .section .rdata,"dr"
    .align 4
.L8:
    .long   .L3-.L8
    .long   .L4-.L8
    .long   .L5-.L8
    .long   .L6-.L8
    .long   .L7-.L8
    .section    .text.startup,"x"
.L7:
    lea rcx, .LC5[rip]
    call    printf
    .p2align 4,,10


.L2:
    xor eax, eax
    add rsp, 56
    ret

.L6:
    lea rcx, .LC4[rip]
    call    printf
    jmp .L2

     // all the other cases are essentially the same as the one above (.L6)
     // where they jump to .L2 to exit instead of simply falling through to it
     // like .L7 does

Upvotes: 1

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726849

My guess is that this optimization has to do with the fact that you have a return statement immediately after your switch: optimizer realizes that it could piggyback on the returns embedded into your print0..print4 functions, and reduces call to jmp; the ret the CPU hits inside the selected printN serves as the return from the main.

Try inserting some code after the switch to see if the compiler would replace jmp with call.

#include <stdio.h>

static inline void print0() { printf("Zero"); }
static inline void print1() { printf("One"); }
static inline void print2() { printf("Two"); }
static inline void print3() { printf("Three"); }
static inline void print4() { printf("Four"); }

int main()
{
    unsigned int input;
    scanf("%u", &input);

    switch (input)
    {
        case 0: print0(); break;
        case 1: print1(); break;
        case 2: print2(); break;
        case 3: print3(); break;
        case 4: print4(); break;
    }
    /* Inserting this line should force the compiler to use call */
    printf("\nDone");
    return 0;
}

EDIT : Your code on ideone has a jmp for a different reason: it's an equivalent of this:

static const char* LC0 ="Zero";
static const char* LC1 ="One";
static const char* LC2 ="Two";
static const char* LC3 ="Three";
static const char* LC4 ="Four";

int main()
{
    unsigned int input;
    scanf("%u", &input);

    switch (input)
    {
        case 0: printf(LC0); break;
        case 1: printf(LC1); break;
        case 2: printf(LC2); break;
        case 3: printf(LC3); break;
        case 4: printf(LC4); break;
    }
    printf("\nDone");
    return 0;
}

Upvotes: 2

Bo Persson
Bo Persson

Reputation: 92321

Compiler writers have a lot of work to do. Obviously they prioritize the work that has the biggest and fastest payoff.

Switch statements are common in all kinds of code, so any optimizations performed on them will have an effect on lots of programs.

This code

jt[input](); 

is a lot less common, and therefore a lot longer down on the compiler designers' TODO-list. Perhaps they haven't (yet) found it worth the effort to try to optimize it? Will that win them any known benchmarks? Or improve some widely used codebase?

Upvotes: 8

Matt Joiner
Matt Joiner

Reputation: 118600

Because the array of function pointers is mutable. The compiler has decided it can't assume the pointers won't be changed. You might find the assembly different for C++, and/or make jt const.

Upvotes: 4

Related Questions