Reputation: 91830
I've been studying a lot about the ABI for x86-64, writing Assembly, and studying how the stack and heap work.
Given the following code:
#include <linux/seccomp.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
// execute the seccomp syscall (could be any syscall)
seccomp(...);
return 0;
}
In Assembly for x86-64, this would do the following:
seccomp
.call seccomp
.seccomp
returns, it's likely the that the C will call exit(0)
as far as I know.I'd like to talk about what happens between step three and four above.
I currently have my stack for the currently running process with its own data in registers and on the stack. How does the userspace process turn over execution to the kernel? Does the kernel just pick up when the call is made and then push to and pop from the same stack?
I believe I heard somewhere that syscalls don't happen immediately but on certain CPU ticks or interrupts. Is this true? How does this happen, for example, on Linux?
Upvotes: 4
Views: 1103
Reputation: 365707
syscalls don't happen immediately but on certain CPU ticks or interrupts
Totally wrong. The CPU doesn't just sit there doing nothing until a timer interrupt. On most architectures, including x86-64, switching to kernel mode takes tens to hundreds of cycles, but not because the CPU is waiting for anything. It's just a slow operation.
Note that glibc provides function wrappers around nearly every syscall, so if you look at disassembly you'll just see a normal-looking function call.
See the AMD64 SysV ABI docs, linked from the x86 tag wiki. It specifies which registers to put args in, and that system calls are made with the syscall
instruction. Intel's insn ref manual (also linked from the tag wiki) documents in full detail every change that syscall
makes to the architectural state of the CPU. If you're interested in the history of how it was designed, I dug up some interesting mailing list posts from the amd64 mailing list between AMD architects and kernel devs. AMD updated the behaviour before the release of the first AMD64 hardware so it was actually usable for Linux (and other kernels).
32bit x86 uses the int 0x80
instruction for syscalls, or sysenter
. syscall
isn't available in 32bit mode, and sysenter
isn't available in 64bit mode. You can run int 0x80
in 64bit code, but you still get the 32bit API that treats pointers as 32bit. (i.e. don't do it). BTW, perhaps you were confused about syscalls having to wait for interrupts because of int 0x80
? Running that instruction fires that interrupt on the spot, jumping right to the interrupt handler. 0x80
is not an interrupt that hardware can trigger, either, so that interrupt handler only ever runs after a software-triggered system call.
#include <stdlib.h>
#include <unistd.h>
#include <linux/unistd.h> // for __NR_write
const char msg[]="hello world!\n";
ssize_t amd64_write(int fd, const char*msg, size_t len) {
ssize_t ret;
asm volatile("syscall" // volatile because we still need the side-effect of making the syscall even if the result is unused
: "=a"(ret) // outputs
: [callnum]"a"(__NR_write), // inputs: syscall number in rax,
"D" (fd), "S"(msg), "d"(len) // and args, in same regs as the function calling convention
: "rcx", "r11", // clobbers: syscall always destroys rcx/r11, but Linux preserves all other regs
"memory" // "memory" to make sure any stores into buffers happen in program order relative to the syscall
);
}
int main(int argc, char *argv[]) {
amd64_write(1, msg, sizeof(msg)-1);
return 0;
}
int glibcwrite(int argc, char**argv) {
write(1, msg, sizeof(msg)-1); // don't write the trailing zero byte
return 0;
}
compiles to this asm output, with the godbolt Compiler Explorer:
gcc's -masm=intel
output is somewhat MASM-like, in that it uses the OFFSET
keywork to get the address of a label.
.rodata
msg:
.string "hello world!\n"
.text
main: // using an in-line syscall
mov eax, 1 # __NR_write
mov edx, 13 # string length
mov esi, OFFSET FLAT:msg # string pointer
mov edi, eax # file descriptor = 1 happens to be the same as __NR_write
syscall
xor eax, eax # zero the return value
ret
glibcwrite: // using the normal way that you get from compiler output
sub rsp, 8 // keep the stack 16B-aligned for the function call
mov edx, 13 // put args in registers
mov esi, OFFSET FLAT:msg
mov edi, 1
call write
xor eax, eax
add rsp, 8
ret
glibc's write
wrapper function just puts 1 in eax and runs syscall
, then checks the return value and sets errno. Also handles restarting syscalls on EINTR and stuff.
// objdump -R -Mintel -d /lib/x86_64-linux-gnu/libc.so.6
...
00000000000f7480 <__write>:
f7480: 83 3d f9 27 2d 00 00 cmp DWORD PTR [rip+0x2d27f9],0x0 # 3c9c80 <argp_program_version_hook+0x1f8>
f7487: 75 10 jne f7499 <__write+0x19>
f7489: b8 01 00 00 00 mov eax,0x1
f748e: 0f 05 syscall
f7490: 48 3d 01 f0 ff ff cmp rax,0xfffffffffffff001 // I think that's -EINTR
f7496: 73 31 jae f74c9 <__write+0x49>
f7498: c3 ret
... more code to handle cases where one of those branches was taken
Upvotes: 8
Reputation: 14619
syscalls don't happen immediately but on certain CPU ticks or interrupts
Certainly the effect of your syscall could be dependent on many things including ticks. Scheduler granularity or the resolution of timing could be limited to tick period, e.g. But the call itself should happen "immediately" (inline with execution).
How does the userspace process turn over execution to the kernel? Does the kernel just pick up when the call is made and then push to and pop from the same stack?
It probably varies slightly between architectures but in general the syscall arguments are assembled by the libc
and then a processor exception is generated in order to change context.
For additional details, see: "How system calls work on x86 linux"
Upvotes: 5