Reputation: 323

Docker - init, zombies - why does it matter?

I did read this article: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/

To set some context: Article is about problem with zombies in containers, it try to convince us that it is a real problem.

Generally, I have mixed feelings. Why does it matter ? After all, even in case zombies in conainer host OS is able to release/kill this zombie. We know that process in container is from point of view host OS normal process (and in general process in container is normal process with some namespaces and cgroups).

Moreover, we can also find information that in order to avoid zombie problem we should use bash -c .... Why ? Maybe, better option is to use --init ?

Can someone try to explain these thing, please ?

Upvotes: 20

Answers (3)

benjimin

Reputation: 4900

There are two or three related problems that potentially occur if the container does have a proper init as its root process (and which may be averted by using docker run --init to inject a basic init program).

Zombie proliferation

In Linux, each process has a parent (except the root process, which is normally init). When a process exits, a zombie entry is retained in the kernel task table, waiting for the parent process to reap the child's exit status. (This mechanism lets the child signal whether it completed successfully or encountered an error.)

If the application uses processes dynamically (regardless of whether they are attributed to the same or to separate process groups), the circumstance can arise where a child process occasionally gets terminated while a grandchild process is still running. The kernel responds by adopting the orphan process to the root process. So when that adopted process exits, it remains as a zombie until the root process explicitly reaps it. The problem is that if the root process is not designed to replace init, it will never check for zombie adopted processes to reap, so the zombies gradually accumulate. Initially this may cause distracting clutter for process monitoring tools. However, as zombies count towards the Linux system process limit, a long-running container application may eventually start failing to create new processes, resulting in unexpected (and probably untested) behaviour. (The system limit is often around 16,000 processes, but if using cgroups the host may optionally configure a much lower max process ids per container. If the host is not well configured then the effect will be system-wide, causing other containers or services on the same host to also fail unpredictably. There might also be poorer multitasking responsiveness.)

Even if the application's high-level code does not explicitly use multiprocessing, the application may still be vulnerable if it is built on libraries that implement multiprocessing internally, say for non-blocking IO (e.g., for servers handling web requests) or for fast CPU-intensive number-crunching (e.g., for data processing).

Ungraceful shutdown

There are a couple signalling issues that interfere with graceful shutdown of the container, potentially causing errors and data corruption.

The request for a container to safely shut itself down is made by sending the SIGTERM signal to the root process. Once the root process exits, or if the root process has still not exited after a reasonable grace period (e.g., after half a minute in Kubernetes), all processes remaining in the container will be forcibly killed (via the SIGKILL signal which gets handled directly by the kernel). It is preferable to avoid needing to resort to SIGKILL, because it deprives processes of the opportunity to clean up (e.g., to close connections and flush memory buffers to storage volumes) and so is prone to result in data loss or corruption and communication errors.

In Linux, each process holds a disposition for each kind of signal. The disposition specifies whether the signal will be ignored, or invoke a handler, or trigger a default action (which for most signals is to terminate the process and leave an exit code specific to the signal). However, Linux disables default actions for PID#1. This means that by default when a normal process is sent SIGINT or SIGTERM it will terminate, but if the same program runs as the root process then it will ignore those signals and keep running, unless the program sets up a custom handler for those signals. (Note SIGINT is usually triggered by CTRL-C in an interactive session.) Thus, even a single-process container may exhibit misbehaviour such as silently refusing to shut itself down gracefully, and being non-responsive to user interrupts.

For a multiprocess application to shutdown gracefully, the parent process will likely install a handler to forward any SIGTERM on to the child processes that the parent created. However, if it is running as the root process then it additionally needs to check for adopted child processes and forward SIGTERM to these as well. For example, if there are service daemons spawned within the container and the root process was not designed to function as init, then those services may never get passed the signal to gracefully shut down.

Furthermore, even if the SIGTERM does propagate to all processes, the container root process needs to wait until there are no child processes remaining before it exits. Otherwise, remaining child processes will get killed before they can finish cleaning up. (Note waiting is nonessential for other parent processes because, after SIGTERM has been forwarded, the child process normally can still finish running after it is orphaned. This issue only concerns premature exits by a container root process.)

Thus, the special handling of PID#1 means that if the main process was not designed to run as init then it can easily cause some or all of the container processes to end up getting unceremoniously killed (by neglecting to signal them and/or fouling the container shutdown flow).

Upvotes: 1

Matt

Reputation: 74851

For a brief but useful explanation of what an init process gives you, look at tini which is what Docker uses when you specify --init

Using Tini has several benefits:

It protects you from software that accidentally creates zombie processes, which can (over time!) starve your entire system for PIDs (and make it unusable).

It ensures that the default signal handlers work for the software you run in your Docker image. For example, with Tini, SIGTERM properly terminates your process even if you didn't explicitly install a signal handler for it.

Both these issues affect containers. A process in a container is still a process on the host, so it takes up a PID on the host. Whatever you run in a container is PID 1 which means it has to install a signal handler to get that signal.

Bash happens to have a process reaper included, so running a command under bash -c can protect against zombies. Bash won't handle signals by default as PID 1 unless you trap them.

Zombies

The first thing to understand is an init process doesn't magically remove zombies. A (normal) init is designed to reap zombies when the parent process that failed to wait on them exits and the zombies hang around. The init process then becomes the zombies parent and they can be cleaned up.

Next, a container is a cgroup of processes running in their own PID namespace. This cgroup is cleaned up when the container is stopped. Any zombies that are in a container are removed on stop. They don't reach the hosts init.

Third is the different ways containers are used. Most run one main process and nothing else. If there is another process spawned it is usually a child of that main process. So until the parent exits, the zombie will exist. Then see point 2 (the zombies will be cleared on container exit).

Running a Node.js, Go or Java app server in a container tends not to rely heavily on forking or spawning of processes.

Running something like a Jenkins worker that spawns large numbers of ad hoc jobs involving shells can result in a lot worse, but is ephemeral so exits regularly and cleans up

Running a Jenkins master that also spawns jobs. The container may hang around for a long time and leave a number of zombie processes which is the type of workload that could present a problem without a zombie reaper.

Signals

The other role an init process can provide is to install signal handlers so signals sent from the host can be passed onto the container process. PID 1 is a bit special as it requires the process to listen for a signal for it to be received.

If you can install a SIGINT and SIGTERM signal handler in your PID 1 process then an init process doesn't add much here.

When to use an init

When you want to run more than 1 service in a container

Multiple processes should be run under an init process. When Docker starts, the init manages how should they be launched. What is required for the container to actually be "running" for the service it represents. When the container stops, how that should be passed onto each process. You may want a more traditional init system though, s6 via s6-overlay provides a number of useful container features for multi process management.

When you run a single process that spawns a lot of child processes

Especially when processes are children of children or beyond. The CI worker (like Jenkins) example is the first that comes to mind where Java spawns command or shells that spawn other commands.

When you can't add signal handlers to the process running as PID 1.

sleep is a simple example of this. A docker run busybox sleep 60 can't be interrupted with ctrl-c or stopped, it will be killed after the default 10 second docker stop timeout. docker run --init busybox sleep 60 works as expected.

Whenever

tini is pretty minimal overhead and widely used, so why not use --init most of the time?

For more details see this github comment which answers the "why?" question from the creator of tini.

Upvotes: 36

VonC

Reputation: 1327784

I referenced that article in "Use of Supervisor in docker"

Since Sept. 2016 and docker 1.12, docker run --init is helping fighting zombie processes by adding an init process.

That solves typically the following issue

We can't use docker start as we need to pass things like port mappings, and env vars. So we use docker run.
But when upstart sends SIGINT to the docker run client process, the container doesn't die, just the client does. Then when upstart goes to start it back up, it's already running, and the port mapping fails.

Or this issue:

Docker seems to hang when spawning child processes inside executed scripts.

Basically, you want a docker container to kill all sub-processes, in order to clean resources (port, files handlers, ...) used by said sub-processes.

Upvotes: 4