mmk
mmk

Reputation: 605

Why can't kernel code use a Red Zone

It is highly recommended when creating a 64-bit kernel (for x86_64 platform), to instruct the compiler not to use the 128-byte Red Zone that the user-space ABI does. (For GCC the compiler flag is -mno-red-zone).

The kernel would not be interrupt-safe if it is enabled.

But why is that?

Upvotes: 29

Views: 7533

Answers (4)

08822407d
08822407d

Reputation: 108

I will give you an example of the quote of wikipedia:

The red zone is well-known to cause problems for x86-64 kernel developers, as the CPU itself doesn't respect the red zone when calling interrupt handlers. This leads to a subtle kernel breakage as the ABI contradicts the CPU behavior.

In my kernel, I use Linux memcpy() c function:

void *memcpy(void *dest, const void *src,
                size_t count)
{
    char *tmp = dest;
    const char *s = src;

    while (count--)
        *tmp++ = *s++;
    return dest;
}

And the disassembly is:

0000000000000000 <memcpy>:
   0:   f3 0f 1e fa             endbr64 
   4:   55                      push   %rbp
   5:   48 89 e5                mov    %rsp,%rbp
   8:   48 8d 05 f9 ff ff ff    lea    -0x7(%rip),%rax        # 8 <memcpy+0x8>
   f:   49 bb 00 00 00 00 00    movabs $0x0,%r11
  16:   00 00 00 
  19:   4c 01 d8                add    %r11,%rax
  1c:   48 89 7d e8             mov    %rdi,-0x18(%rbp)
  20:   48 89 75 e0             mov    %rsi,-0x20(%rbp)
  24:   48 89 55 d8             mov    %rdx,-0x28(%rbp)
  28:   48 8b 45 e8             mov    -0x18(%rbp),%rax
  2c:   48 89 45 f8             mov    %rax,-0x8(%rbp)
  30:   48 8b 45 e0             mov    -0x20(%rbp),%rax
  34:   48 89 45 f0             mov    %rax,-0x10(%rbp)
  38:   eb 1d                   jmp    57 <memcpy+0x57>
  3a:   48 8b 55 f0             mov    -0x10(%rbp),%rdx
  3e:   48 8d 42 01             lea    0x1(%rdx),%rax
  42:   48 89 45 f0             mov    %rax,-0x10(%rbp)
  46:   48 8b 45 f8             mov    -0x8(%rbp),%rax
  4a:   48 8d 48 01             lea    0x1(%rax),%rcx
  4e:   48 89 4d f8             mov    %rcx,-0x8(%rbp)
  52:   0f b6 12                movzbl (%rdx),%edx
  55:   88 10                   mov    %dl,(%rax)
  57:   48 8b 45 d8             mov    -0x28(%rbp),%rax
  5b:   48 8d 50 ff             lea    -0x1(%rax),%rdx
  5f:   48 89 55 d8             mov    %rdx,-0x28(%rbp)
  63:   48 85 c0                test   %rax,%rax
  66:   75 d2                   jne    3a <memcpy+0x3a>
  68:   48 8b 45 e8             mov    -0x18(%rbp),%rax
  6c:   5d                      pop    %rbp
  6d:   c3                      retq

Note the instruction in 1c to 24, three arguments stored on stack by "mov" but not "push", the same as 2c and 34 are the two local variables.

And now is the problem. I compiled my x86_64 kernel on ubuntu, with gcc default x64 abi(sysv amd64 abi, implicit red zone). When run into this function, called by exec, surely will trigger copy-on-write(means will cause page-fault exception first), the variables address and %RSP look like: screen shot of debug session 1

You can see the %RSP is adjacent ABOVE the stored args and localvars, so guess what whill happen when exception raised on an x86_64 machine ---- cpu autosave at least 5 registers on stack ---- they override the args and localvars.

And then compiled it with option -mno-red-zone, the beginning part of disassembly:

0000000000000000 <memchr>:
   0:   f3 0f 1e fa             endbr64 
   4:   55                      push   %rbp
   5:   48 89 e5                mov    %rsp,%rbp
   8:   48 83 ec 28             sub    $0x28,%rsp
   c:   48 8d 05 f9 ff ff ff    lea    -0x7(%rip),%rax        # c <memchr+0xc>

Note the difference with the former? It preserve the stack space of args and localvars with

8:   48 83 ec 28             sub    $0x28,%rsp

And the running result:screen shot of debug session 2 Now the %RSP is BELOW the args and localvars.

So the core reason is that: in leaf function in normal case, there is no need to adjust %RSP to stack top, so with red-zone mechanism %RSP won't be adjusted. But in kernel, the kernel code and exception/interrrupt code share the kernel-stack(unless you prepare isolate stack for exception/interrupt , for X86_64 cpu it is IST), when leaf function interrupted, args and localvars will be override

Upvotes: 1

mevets
mevets

Reputation: 10445

It is possible to use red-zone in kernel-type contexts. The IDTentry can specify a stack index (ist) of 0..7, where 0 is a bit special. The TSS contains a table of these stacks. 1..7 are loaded, and used for the initial registers saved by the exception/interrupt, and do not nest. If you partition the various exception entries by priorities (eg. NMI is the highest and can happen at any time) and treat these stacks as trampolines, you can safely handle red zones in kernel-type contexts. That is, you can subtract 128 from the saved stack pointer to get a usable kernel stack before enabling interrupts or code which can cause exceptions.

The zero index stack behaves in a more conventional manner, pushing the stack,flags,pc,error on the existing stack when there is no privilege transition.

The code in the trampoline has to be careful (duh, it is a kernel) not to generate other exceptions while it sanitizes the machine state, but provides a nice, safe spot to detect pathological kernel nesting, stack corruption, etc... [ sorry to respond so late, noticed this while searching for something else].

Upvotes: 21

Peter Cordes
Peter Cordes

Reputation: 364287

In kernel-space, you're using the same stack that interrupts use. When an interrupt happens, the CPU pushes a return address and RFLAGS. This clobbers 16 bytes below rsp. Even if you wanted to write an interrupt-handler that assumed the full 128 bytes of the red-zone were valuable, it would be impossible.


You could maybe have a kernel-internal ABI that had a small red-zone from rsp-16 to rsp-48 or something. (Small because kernel stack is valuable, and most functions don't need very much red-zone anyway.)

Interrupt handlers would have to sub rsp, 32 before pushing any registers. (and restore it before iret).

This idea won't work if an interrupt handler can itself be interrupted before it runs sub rsp, 32, or after it restores rsp before an iret. There would be a window of vulnerability where valuable data is at rsp .. rsp-16.


Another practical problem with this scheme is that AFAIK gcc doesn't have configurable red-zone parameters. It's either on or off. So you'd have to add support for a kernel flavour of red-zone to gcc / clang if you wanted to take advantage of it.

Even if it was safe from nested interrupts, the benefits are pretty small. The difficulty of proving it's safe in a kernel might make it not worth it. (And as I said, I'm not at all sure it can be implemented safely, because I think nested interrupts are possible.)


(BTW, see the tag wiki for links to the ABI documenting the red-zone, and other stuff.)

Upvotes: 16

qdot
qdot

Reputation: 6335

Quoting from the AMD64 ABI:

The 128-byte area beyond the location pointed to by %rsp is considered to be reserved and shall not be modified by signal or interrupt handlers. Therefore, functions may use this area for temporary data that is not needed across function calls. In particular, leaf functions may use this area for their entire stack frame, rather than adjusting the stack pointer in the prologue and epilogue. This area is known as the red zone.

Essentially, it's an optimization - the userland compiler knows exactly how much of the Red Zone is used at any given time (in the simplest implementation, the entire size of local variables) and can adjust the %rsp accordingly before calling a sub-function.

Especially in leaf functions, this can yield some performance benefits of not having to adjust %rsp as we can be certain no unfamiliar code would run while in the function. (POSIX Signal Handlers might be seen as a form of a co-routine, but you can instruct the compiler to adjust the registers before using stack variables in a signal handler).

In the kernel space, once you start thinking about interrupts, if those interrupts make any assumptions about %rsp, they will likely be incorrect - there is no certainty with regards to the utilization of the Red Zone. So, you either assume all of it is dirty, and needlessly waste stack space (effectively running with a 128-byte guaranteed local variable in every function), or, you guarantee that the interrupts make no assumptions about %rsp - which is tricky.

In user space, context switches + 128-byte overallocation of stack handle it for you.

Upvotes: 18

Related Questions