Reputation: 11
I am reading Michael Kerrisk's "Namespaces in operation" series (as I want to implement a container by myself in Linux), and I found myself wondering about something:
In one of Michael's examples to PID namespace usage, he wrote the next program: https://lwn.net/Articles/532745/
Here, the stack of the process is a buffer that is statically allocated, and by Michaels' own comment, it is "
/* Space for child's stack */
/* Since each child gets a copy of virtual memory, this
buffer can be reused as each child creates its child */
As I understand it, this is an attempt to take advantage of COW mechanism in UNIX systems.
Now, it appears that there's a bug in this code, and the bug is (by Michael's own explanation):
namespaces/multi_pidns.c
Allocate stacks for the child processes on the heap rather
than in static memory. Marcos Paulo de Souza pointed out
that the children were being killed by SIGSEGV after they
had completed the sleep() calls. (Some further investigation
showed that all children except the *last* are killed with
SIGSEGV.) It appears that they are killed after the child
start function returns. The problem goes away if the
children are allocated stacks in separate memory areas by
calling malloc() (which is the change made in this patch)
or in separate statically allocated buffers.
The reason that the children were killed is based on (my
misunderstanding of) the subtleties of the magic done
in the glibc clone() wrapper function. (See, for example,
the x86-64 implementation in the glibc source file
sysdeps/unix/sysv/linux/x86_64/clone.S.) The
previous code was relying on the fact that the parent's
memory was duplicated in the child during the clone() system
call, and the assumption that that duplicated memory could be
used in the child. However, before executing the clone()
system call, the clone() wrapper function saves some
information (that will be used by the child) onto the stack.
This happens in the address space of the parent, before the
memory is duplicated in the system call. Since the previous
code was making use of the same statically allocated buffer
(i.e., the same address as was used for the parent's stack)
for the child stack, the consequence was that the steps in
the clone() wrapper function were corrupting the stack of the
*parent* process, which ultimately resulted in (all but the
last of) the child processes crashing.
The fixed program is here: https://man7.org/tlpi/code/online/dist/namespaces/multi_pidns.c.html
I feel like I am actually way off in my understanding of the way the corruption of the stacks happen. Why would it happen if there is a copy of the allocated stack in each child process, where the child process can write to without corrupting it's parent's stack?
Couldn't find any other thorough explanation by Michael or in the internet about it.
I tried to find a thorough explanation by Michael about the matter, ask ChatGPT, read about clone() system call and about stack allocation with respect to COW and reusing stacks. Couldn't find a valid answer.
Upvotes: 1
Views: 34
Reputation: 782785
I haven't looked at the glibc
code, but I think this is what it's describing:
The clone()
function takes a pointer to the memory that should be used as the child process's stack. It assumes that this memory will only be used for that purpose, and it will only be treated as a stack in the child process after it's spawned by the clone()
system call. So it uses it to hold some temporary data while it's working.
What's happening is that when one of the child processes creates a new child, its own stack is at the same memory location as this static buffer. So when clone()
stores temporary data there, it's overwriting some fields in the process's own stack.
What's confusing is the use of "parent" in that explanation. It's not a problem for the original parent process, but the program creates a hierarchy of processes recursively, and the problem happens in these nested processes because the static buffer overlaps their stacks.
Upvotes: 1