About reusing parent process' stack in child processes

Question

I am reading Michael Kerrisk's "Namespaces in operation" series (as I want to implement a container by myself in Linux), and I found myself wondering about something:

In one of Michael's examples to PID namespace usage, he wrote the next program: https://lwn.net/Articles/532745/

Here, the stack of the process is a buffer that is statically allocated, and by Michaels' own comment, it is "

                /* Space for child's stack */
                /* Since each child gets a copy of virtual memory, this
                   buffer can be reused as each child creates its child */

As I understand it, this is an attempt to take advantage of COW mechanism in UNIX systems.

Now, it appears that there's a bug in this code, and the bug is (by Michael's own explanation):

namespaces/multi_pidns.c

                Allocate stacks for the child processes on the heap rather
                than in static memory. Marcos Paulo de Souza pointed out
                that the children were being killed by SIGSEGV after they
                had completed the sleep() calls. (Some further investigation
                showed that all children except the *last* are killed with
                SIGSEGV.) It appears that they are killed after the child
                start function returns. The problem goes away if the
                children are allocated stacks in separate memory areas by
                calling malloc() (which is the change made in this patch)
                or in separate statically allocated buffers.

                The reason that the children were killed is based on (my
                misunderstanding of) the subtleties of the magic done
                in the glibc clone() wrapper function. (See, for example,
                the x86-64 implementation in the glibc source file
                sysdeps/unix/sysv/linux/x86_64/clone.S.)  The
                previous code was relying on the fact that the parent's
                memory was duplicated in the child during the clone() system
                call, and the assumption that that duplicated memory could be
                used in the child. However, before executing the clone()
                system call, the clone() wrapper function saves some
                information (that will be used by the child) onto the stack.
                This happens in the address space of the parent, before the
                memory is duplicated in the system call. Since the previous
                code was making use of the same statically allocated buffer
                (i.e., the same address as was used for the parent's stack)
                for the child stack, the consequence was that the steps in
                the clone() wrapper function were corrupting the stack of the
                *parent* process, which ultimately resulted in (all but the
                last of) the child processes crashing.

The fixed program is here: https://man7.org/tlpi/code/online/dist/namespaces/multi_pidns.c.html

I feel like I am actually way off in my understanding of the way the corruption of the stacks happen. Why would it happen if there is a copy of the allocated stack in each child process, where the child process can write to without corrupting it's parent's stack?

Couldn't find any other thorough explanation by Michael or in the internet about it.

I tried to find a thorough explanation by Michael about the matter, ask ChatGPT, read about clone() system call and about stack allocation with respect to COW and reusing stacks. Couldn't find a valid answer.

About reusing parent process' stack in child processes

Answers (1)

Related Questions

About reusing parent process&#39; stack in child processes

Answers (1)

Related Questions

About reusing parent process' stack in child processes