Reputation: 64955

Fastest Linux system call

On an x86-64 Intel system that supports syscall and sysret what's the "fastest" system call from 64-bit user code on a vanilla kernel?

In particular, it must be a system call that exercises the syscall/sysret user <-> kernel transition¹, but does the least amount of work beyond that. It doesn't even need to do the syscall itself: some type of early error which never dispatches to the specific call on the kernel side is fine, as long as it doesn't go down some slow path because of that.

Such a call could be used to estimate the raw syscall and sysret overhead independent of any work done by the call.

¹ In particular, this excludes things that appear to be system calls but are implemented in the VDSO (e.g., clock_gettime) or are cached by the runtime (e.g., getpid).

Upvotes: 13

Answers (4)

Basile Starynkevitch

Reputation: 1

Some system calls don't even go thru any user->kernel transition, read vdso(7).

I suspect that these VDSO system calls (e.g. time(2), ...) are the fastest. You could claim that there are no "real" system calls.

BTW, you could add a dummy system call to your kernel (e.g. some system call always returning 0, or a hello world system call, see also this) and measure it.

I suspect (without having benchmarked it) that getpid(2) should be a very fast system call, because the only thing it needs to do is fetch some data from the kernel memory. And AFAIK, it is a genuine system call, not using VDSO techniques. And you could use syscall(2) to avoid its caching done by your libc and forcing the genuine system call.

I maintain my position (given in a comment to your initial question): without actual motivation your question does not make any concrete sense. Then I still do think that syscall(2) doing getpid is measuring the typical overhead to make a system call (and I guess you really care about that one). In practice almost all system calls are doing more work that such a getpid (or getppid).

Upvotes: 3

Peter Cordes

Reputation: 364448

Use an invalid system call number so the dispatching code simply returns with
eax = -ENOSYS instead of dispatching to a system-call handling function at all.

Unless this causes the kernel to use the iret slow path instead of sysret / sysexit. That might explain the measurements showing an invalid number being 17 cycles slower than syscall(SYS_getpid), because glibc error handling (setting errno) probably doesn't explain it. But from my reading of the kernel source, I don't see any reason why it wouldn't still use sysret while returning -ENOSYS.

This answer is for sysenter, not syscall. The question originally said sysenter / sysret (which was weird because sysexit goes with sysenter, while sysret goes with syscall). I answered based on sysenter for a 32-bit process on an x86-64 kernel.

Native 64-bit syscall is handled more efficiently inside the kernel. (Update; with Meltdown / Spectre mitigation patches, it still dispatches via C do_syscall_64 in 4.16-rc2).

My What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? Q&A gives an overview of the kernel side of system-call entry points from compat mode into an x86-64 kernel (entry_64_compat.S). This answer is just taking the relevant parts of that.

The links in that answer and this are to Linux 4.12 sources, which doesn't contain the Meltdown mitigation page-table manipulation, so that will be significant extra overhead.

int 0x80 and sysenter have different entry points. You're looking for entry_SYSENTER_compat. AFAIK, sysenter always goes there, even if you execute it in a 64-bit user-space process. Linux's entry point pushes a constant __USER32_CS as the saved CS value, so it will always return to user-space in 32-bit mode.

After pushing registers to construct a struct pt_regs on the kernel stack, there's a TRACE_IRQS_OFF hook (no idea how many instructions that amounts to), then call do_fast_syscall_32 which is written in C. (Native 64-bit syscall dispatching is done directly from asm, but 32-bit compat system calls are always dispatched through C).

do_syscall_32_irqs_on in arch/x86/entry/common.c is pretty light-weight: just a check if the process is being traced (I think this is how strace can hook system calls via ptrace), then

   ...
    if (likely(nr < IA32_NR_syscalls)) {
        regs->ax = ia32_sys_call_table[nr]( ... arg );
    }

    syscall_return_slowpath(regs);
}

AFAIK, the kernel can use sysexit after this function returns.

So the return path is the same whether or not EAX had a valid system call number, and obviously returning without dispatching at all is the fastest path through that function, especially in a kernel with Spectre mitigation where the indirect branch on the table of function pointers would go through a retpoline and always mispredict.

If you want to really test sysenter/sysexit without all that extra overhead, you'll need to modify Linux to put a much simpler entry point without checking for tracing or pushing / popping all the registers.

You'd probably also want to modify the ABI to pass a return address in a register (like syscall does on its own) instead of saved on the user-space stack which Linux's current sysenter ABI does; it has to get_user() to read the EIP value it should return to.

Of if all this overhead is part of what you want to measure, you're definitely all set with an eax that gives you -ENOSYS; at worst you'll be getting one extra branch miss from the range-check if branch predictors are hot for that branch based on normal 32-bit system calls.

Upvotes: 5

janneb

Reputation: 37208

In this benchmark by Brendan Gregg (linked from this blog post which is interesting reading on the topic) close(999) (or some other fd not in use) is recommended.

Upvotes: 3

Timothy Baldwin

Reputation: 3675

One that doesn't exist, and therefore returns -ENOSYS quickly.

From arch/x86/entry/entry_64.S:

#if __SYSCALL_MASK == ~0
    cmpq    $__NR_syscall_max, %rax
#else
    andl    $__SYSCALL_MASK, %eax
    cmpl    $__NR_syscall_max, %eax
#endif
    ja  1f              /* return -ENOSYS (already in pt_regs->ax) */
    movq    %r10, %rcx

    /*
     * This call instruction is handled specially in stub_ptregs_64.
     * It might end up jumping to the slow path.  If it jumps, RAX
     * and all argument registers are clobbered.
     */
#ifdef CONFIG_RETPOLINE
    movq    sys_call_table(, %rax, 8), %rax
    call    __x86_indirect_thunk_rax
#else
    call    *sys_call_table(, %rax, 8)
#endif
.Lentry_SYSCALL_64_after_fastpath_call:

    movq    %rax, RAX(%rsp)
1:

Upvotes: 11

Fastest Linux system call

Answers (4)

Related Questions