Reputation: 941
I run a docker cluster with a few thousand containers and a few times per day randomly I have a process that gets "stuck" blocking a container from stopping. Below is an example container with its corresponding process and all things I have tried to kill the container / process.
The container:
# docker ps | grep 950677e2317f
950677e2317f 7e553d1d9f6f "/bin/sh -c /minecraf" 2 days ago Up 2 days 0.0.0.0:22661->22661/tcp, 0.0.0.0:22661->22661/udp, 0.0.0.0:37681->37681/tcp, 0.0.0.0:37681->37681/udp gloomy_jennings
Try to stop container using docker daemon (it tries forever without result):
# time docker stop --time=1 950677e2317f
^C
real 0m13.508s
user 0m0.036s
sys 0m0.008s
Daemon log while trying to stop:
# journalctl -fu docker.service
-- Logs begin at Fri 2015-12-11 15:40:55 CET. --
Dec 31 23:30:33 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:33.164731953+01:00" level=info msg="POST /v1.21/containers/950677e2317f/stop?t=1"
Dec 31 23:30:34 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:34.165531990+01:00" level=info msg="Container 950677e2317fcd2403ef5b5ffad37204e880136e91f76b0a8682e04a93e80942 failed to exit within 1 seconds of SIGTERM - using the force"
Dec 31 23:30:44 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:44.165954266+01:00" level=info msg="Container 950677e2317f failed to exit within 10 seconds of kill - trying direct SIGKILL"
Looking into the processes running on the machine reveals the zombie process (pid 11991 on host machine):
# ps aux | grep [1]1991
root 11991 84.3 0.0 5836 132 ? R Dec30 1300:19 bash -c (echo stop > /tmp/minecraft &)
# top -b | grep [1]1991
11991 root 20 0 5836 132 20 R 89.5 0.0 1300:29 bash
And it is indeed a process running within our container (check container id):
# cat /proc/11991/mountinfo
...
/var/lib/docker/containers/950677e2317fcd2403ef5b5ffad37204e880136e91f76b0a8682e04a93e80942/resolv.conf /etc/resolv.conf rw,relatime - ext4 /dev/sda2 rw,errors=remount-ro,data=ordered
Trying to kill the process yields nothing:
# kill -9 11991
# ps aux | grep [1]1991
root 11991 84.3 0.0 5836 132 ? R Dec30 1303:58 bash -c (echo stop > /tmp/minecraft &)
Some overview data:
# docker version
Client:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5
Built: Fri Nov 20 13:20:08 UTC 2015
OS/Arch: linux/amd64
Server:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5
Built: Fri Nov 20 13:20:08 UTC 2015
OS/Arch: linux/amd64
# docker info
Containers: 189
Images: 322
Server Version: 1.9.1
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 700
Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.0-19-generic
Operating System: Ubuntu 15.10
CPUs: 24
Total Memory: 125.8 GiB
Name: m3561.contabo.host
ID: ZM2Q:RA6Q:E4NM:5Q2Q:R7E4:BFPQ:EEVK:7MEO:YRH6:SVS6:RIHA:3I2K
# uname -a
Linux m3561.contabo.host 4.2.0-19-generic #23-Ubuntu SMP Wed Nov 11 11:39:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
If stopping the docker daemon the process still lives. The only way to get rid of the process is to restart the host machine. As this happens fairly frequently (requires every node to restart every 3-7 days) it has a serious impact on the uptime of the overall cluster.
Any ideas on what to do here?
Upvotes: 4
Views: 6641
Reputation: 5310
I had a similar problem and switching to use overlay2 storage driver made the problem go away. Changing the storage driver will loose all docker state (images & containers). It seems that the aufs storage driver has some problems that might be a source of lock ups.
Upvotes: 0
Reputation: 941
Okay, I think I found the root cause of this. The folks over at Docker helped me out, check out this thread on GitHub.
It turns out this most likely is a bug in the Linux Kernel 4.19+. I'll be rolling back to an older version until it is fixed.
UPDATE: I've been running 3.* only in my cluster for several days now without any issues. This was most certainly a kernel bug.
Upvotes: 3