beetree
beetree

Reputation: 941

Can't kill processes (originating in a docker container)

I run a docker cluster with a few thousand containers and a few times per day randomly I have a process that gets "stuck" blocking a container from stopping. Below is an example container with its corresponding process and all things I have tried to kill the container / process.

The container:

# docker ps | grep 950677e2317f
950677e2317f        7e553d1d9f6f                  "/bin/sh -c /minecraf"   2 days ago          Up 2 days           0.0.0.0:22661->22661/tcp, 0.0.0.0:22661->22661/udp, 0.0.0.0:37681->37681/tcp, 0.0.0.0:37681->37681/udp                                                                                                                                                                                       gloomy_jennings

Try to stop container using docker daemon (it tries forever without result):

# time docker stop --time=1 950677e2317f
^C
real    0m13.508s
user    0m0.036s
sys     0m0.008s

Daemon log while trying to stop:

# journalctl -fu docker.service
-- Logs begin at Fri 2015-12-11 15:40:55 CET. --
Dec 31 23:30:33 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:33.164731953+01:00" level=info msg="POST /v1.21/containers/950677e2317f/stop?t=1"
Dec 31 23:30:34 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:34.165531990+01:00" level=info msg="Container 950677e2317fcd2403ef5b5ffad37204e880136e91f76b0a8682e04a93e80942 failed to exit within 1 seconds of SIGTERM - using the force"
Dec 31 23:30:44 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:44.165954266+01:00" level=info msg="Container 950677e2317f failed to exit within 10 seconds of kill - trying direct SIGKILL"

Looking into the processes running on the machine reveals the zombie process (pid 11991 on host machine):

# ps aux | grep [1]1991
root     11991 84.3  0.0   5836   132 ?        R    Dec30 1300:19 bash -c (echo stop > /tmp/minecraft &)
# top -b | grep [1]1991
11991 root      20   0    5836    132     20 R  89.5  0.0   1300:29 bash

And it is indeed a process running within our container (check container id):

# cat /proc/11991/mountinfo
...
/var/lib/docker/containers/950677e2317fcd2403ef5b5ffad37204e880136e91f76b0a8682e04a93e80942/resolv.conf /etc/resolv.conf rw,relatime - ext4 /dev/sda2 rw,errors=remount-ro,data=ordered

Trying to kill the process yields nothing:

# kill -9 11991
# ps aux | grep [1]1991
root     11991 84.3  0.0   5836   132 ?        R    Dec30 1303:58 bash -c (echo stop > /tmp/minecraft &)

Some overview data:

# docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:20:08 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:20:08 UTC 2015
 OS/Arch:      linux/amd64

# docker info
Containers: 189
Images: 322
Server Version: 1.9.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 700
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.0-19-generic
Operating System: Ubuntu 15.10
CPUs: 24
Total Memory: 125.8 GiB
Name: m3561.contabo.host
ID: ZM2Q:RA6Q:E4NM:5Q2Q:R7E4:BFPQ:EEVK:7MEO:YRH6:SVS6:RIHA:3I2K

# uname -a
Linux m3561.contabo.host 4.2.0-19-generic #23-Ubuntu SMP Wed Nov 11 11:39:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

If stopping the docker daemon the process still lives. The only way to get rid of the process is to restart the host machine. As this happens fairly frequently (requires every node to restart every 3-7 days) it has a serious impact on the uptime of the overall cluster.

Any ideas on what to do here?

Upvotes: 4

Views: 6641

Answers (2)

Lari Hotari
Lari Hotari

Reputation: 5310

I had a similar problem and switching to use overlay2 storage driver made the problem go away. Changing the storage driver will loose all docker state (images & containers). It seems that the aufs storage driver has some problems that might be a source of lock ups.

Upvotes: 0

beetree
beetree

Reputation: 941

Okay, I think I found the root cause of this. The folks over at Docker helped me out, check out this thread on GitHub.

It turns out this most likely is a bug in the Linux Kernel 4.19+. I'll be rolling back to an older version until it is fixed.

UPDATE: I've been running 3.* only in my cluster for several days now without any issues. This was most certainly a kernel bug.

Upvotes: 3

Related Questions