Reputation: 30815

Why is gcc using jmp to call a function in the optimized version

When I diassembled my program, I saw that gcc was using jmp for the second pthread_wait_barrier call when compiled with -O3. Why is it so?

What advantage does it get by using jmp instead of call. What tricks the compiler is playing here? I guess its performing tail call optimization here.

By the way I'm using static linking here.

__attribute__ ((noinline)) void my_pthread_barrier_wait( 
    volatile int tid, pthread_barrier_t *pbar ) 
{
    pthread_barrier_wait( pbar );
    if ( tid == 0 )
    {
        if ( !rollbacked )
        {
            take_checkpoint_or_rollback( ++iter == 4 );
        }
    }
    //getcontext( &context[tid] );
    SETJMP( tid );
    asm("addr2jmp:"); 
    pthread_barrier_wait( pbar );
    // My suspicion was right, gcc was performing tail call optimization, 
    // which was messing up with my SETJMP/LONGJMP implementation, so here I
    // put a dummy function to avoid that.
    dummy_var = dummy_func();
}

Upvotes: 7

Answers (5)

Basile Starynkevitch

Reputation: 1

Perhaps it was a tail-recursive call. GCC has some pass doing tail-recursive optimization.

But why should you bother? If the called function is an extern function, then it is public, and GCC should call it following the ABI conventions (which means that it follows the calling convention).

You should not care if the function was called by a jmp.

And it might also be a call to a dynamic library function (i.e. with the PLT for dynamic linking)

Upvotes: 6

Damon

Reputation: 70186

You will never know, but one of the likely reasons is "cache" (among other reasons such as the already mentioned tail call optimization).

Inlining can make code faster and it can make code slower, because more code means less of it will be in the L1 cache at one time.

A JMP allows the compiler to reuse the same piece of code at little or no cost at all. Modern processors are deeply pipelined, and pipelines go over a JMP without problems (there is no possibility of a misprediction here!). In the average case, it will cost as little as 1-2 cycles, in the best cases zero cycles, because the CPU would have to wait on a previous instruction to retire anyway. This obviously depends totally on the respective, individual code.
The compiler could in principle even do that with several functions that have common parts.

Upvotes: -1

ughoavgfhw

Reputation: 39915

I'm assuming that this is a tail call, meaning either the current function returns the result of the called function unmodified, or (for a function that returns void) returns immediately after the function call. In either case, it is not necessary to use call.

The call instruction performs two functions. First, it pushes the address of the instruction after the call onto the stack as a return address. Then it jumps to the destination of the call. ret pops the return address off of the stack and jumps to that location.

Since the calling function returns the result of the called function, there is no reason for operation to return to it after the called function returns. Therefore, whenever possible and if the optimization level permits it, GCC will destroy its stack frame before the function call, so that the top of the stack contains the return address for the function that called it, and then simply jump to the called function. The result is that, when the called function returns, it returns directly to the first function instead of the calling function.

Upvotes: 2