Naftuli Kay
Naftuli Kay

Reputation: 91830

When a syscall is called by a userspace program, how does execution transfer back to kernelspace?

I've been studying a lot about the ABI for x86-64, writing Assembly, and studying how the stack and heap work.

Given the following code:

#include <linux/seccomp.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
    // execute the seccomp syscall (could be any syscall)
    seccomp(...);

    return 0;
}

In Assembly for x86-64, this would do the following:

  1. Align the stack pointer (as it's off by 8 bytes by default).
  2. Setup registers and the stack for any arguments for the call seccomp.
  3. Execute the following Assembly call seccomp.
  4. When seccomp returns, it's likely the that the C will call exit(0) as far as I know.

I'd like to talk about what happens between step three and four above.

I currently have my stack for the currently running process with its own data in registers and on the stack. How does the userspace process turn over execution to the kernel? Does the kernel just pick up when the call is made and then push to and pop from the same stack?

I believe I heard somewhere that syscalls don't happen immediately but on certain CPU ticks or interrupts. Is this true? How does this happen, for example, on Linux?

Upvotes: 4

Views: 1103

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 365707

syscalls don't happen immediately but on certain CPU ticks or interrupts

Totally wrong. The CPU doesn't just sit there doing nothing until a timer interrupt. On most architectures, including x86-64, switching to kernel mode takes tens to hundreds of cycles, but not because the CPU is waiting for anything. It's just a slow operation.


Note that glibc provides function wrappers around nearly every syscall, so if you look at disassembly you'll just see a normal-looking function call.


What really happens (x86-64 as an example):

See the AMD64 SysV ABI docs, linked from the tag wiki. It specifies which registers to put args in, and that system calls are made with the syscall instruction. Intel's insn ref manual (also linked from the tag wiki) documents in full detail every change that syscall makes to the architectural state of the CPU. If you're interested in the history of how it was designed, I dug up some interesting mailing list posts from the amd64 mailing list between AMD architects and kernel devs. AMD updated the behaviour before the release of the first AMD64 hardware so it was actually usable for Linux (and other kernels).

32bit x86 uses the int 0x80 instruction for syscalls, or sysenter. syscall isn't available in 32bit mode, and sysenter isn't available in 64bit mode. You can run int 0x80 in 64bit code, but you still get the 32bit API that treats pointers as 32bit. (i.e. don't do it). BTW, perhaps you were confused about syscalls having to wait for interrupts because of int 0x80? Running that instruction fires that interrupt on the spot, jumping right to the interrupt handler. 0x80 is not an interrupt that hardware can trigger, either, so that interrupt handler only ever runs after a software-triggered system call.


AMD64 syscall example:

#include <stdlib.h>
#include <unistd.h>
#include <linux/unistd.h>    // for __NR_write

const char msg[]="hello world!\n";

ssize_t amd64_write(int fd, const char*msg, size_t len) {
  ssize_t ret;
  asm volatile("syscall"  // volatile because we still need the side-effect of making the syscall even if the result is unused
               : "=a"(ret)                   // outputs
               : [callnum]"a"(__NR_write),   // inputs: syscall number in rax,
                "D" (fd), "S"(msg), "d"(len)    // and args, in same regs as the function calling convention
               : "rcx", "r11",               // clobbers: syscall always destroys rcx/r11, but Linux preserves all other regs
                 "memory"                    // "memory" to make sure any stores into buffers happen in program order relative to the syscall 
              );
}

int main(int argc, char *argv[]) {
    amd64_write(1, msg, sizeof(msg)-1);
    return 0;
}

int glibcwrite(int argc, char**argv) {
    write(1, msg, sizeof(msg)-1);  // don't write the trailing zero byte
    return 0;
}

compiles to this asm output, with the godbolt Compiler Explorer:

gcc's -masm=intel output is somewhat MASM-like, in that it uses the OFFSET keywork to get the address of a label.

.rodata
msg:
        .string "hello world!\n"

.text
main:   // using an in-line syscall
        mov     eax, 1    # __NR_write
        mov     edx, 13   # string length
        mov     esi, OFFSET FLAT:msg      # string pointer
        mov     edi, eax  # file descriptor = 1 happens to be the same as __NR_write
        syscall
        xor     eax, eax  # zero the return value
        ret

glibcwrite:  // using the normal way that you get from compiler output
        sub     rsp, 8       // keep the stack 16B-aligned for the function call
        mov     edx, 13      // put args in registers
        mov     esi, OFFSET FLAT:msg
        mov     edi, 1
        call    write
        xor     eax, eax
        add     rsp, 8
        ret

glibc's write wrapper function just puts 1 in eax and runs syscall, then checks the return value and sets errno. Also handles restarting syscalls on EINTR and stuff.

// objdump -R -Mintel -d /lib/x86_64-linux-gnu/libc.so.6
...
00000000000f7480 <__write>:
   f7480:       83 3d f9 27 2d 00 00    cmp    DWORD PTR [rip+0x2d27f9],0x0        # 3c9c80 <argp_program_version_hook+0x1f8>
   f7487:       75 10                   jne    f7499 <__write+0x19>
   f7489:       b8 01 00 00 00          mov    eax,0x1
   f748e:       0f 05                   syscall
   f7490:       48 3d 01 f0 ff ff       cmp    rax,0xfffffffffffff001   // I think that's -EINTR
   f7496:       73 31                   jae    f74c9 <__write+0x49>
   f7498:       c3                      ret
   ... more code to handle cases where one of those branches was taken

Upvotes: 8

Brian Cain
Brian Cain

Reputation: 14619

syscalls don't happen immediately but on certain CPU ticks or interrupts

Certainly the effect of your syscall could be dependent on many things including ticks. Scheduler granularity or the resolution of timing could be limited to tick period, e.g. But the call itself should happen "immediately" (inline with execution).

How does the userspace process turn over execution to the kernel? Does the kernel just pick up when the call is made and then push to and pop from the same stack?

It probably varies slightly between architectures but in general the syscall arguments are assembled by the libc and then a processor exception is generated in order to change context.

For additional details, see: "How system calls work on x86 linux"

Upvotes: 5

Related Questions