Reputation: 3154
I'm following the Docker tutorials here https://docs.docker.com/get-started/part3/
When I get execute the command docker swarm leave --force
near the end of the pages tutorial I keep getting a Error response from daemon: context deadline exceeded
Every subsequent time I do the docker swarm leave --force
command the terminal appears to just hang, it doesn't provide the error message anymore, but it doesn't return to the prompt for me to enter any commands unless I do a CTRL+C.
The docker swarm init
command at the beginning of the linked tutorial also is unresponsive when its in this state.
The only time the docker swarm commands work again is if I close out my VM instance and restart it. But when I follow the steps again from the link I get the same error on the docker swarm leave --force
command
Any ideas why its doing this?
I'm running Ubuntu 18.04.1 LTS in Virtual Box, with docker version 18.09.0-rc1, build 6e632f7.
I saw this other link Cannot leave swarm mode about the same issue, it is 2 years old and the answers there appear to be work arounds or full out remove Docker completely and reinstall to get it working. I'm hoping that there is another way to fix this.
Upvotes: 4
Views: 8060
Reputation: 2678
What works for me with failing managers is not restarting the whole node, but stopping the docker service, removing the /var/lib/docker/swarm
directory, restarting docker service and then readding the manager:
On manager-failing (the failing manager):
sudo systemctl stop docker
sudo rm -r /var/lib/docker/swarm
sudo systemctl start docker
On manager-working (other, functioning manager):
docker node demote manager-failing
docker node rm manager-failing
ssh manager-failing $(docker swarm join-token manager | tail -2)
Upvotes: 4
Reputation: 2463
Well, I have some good and bad news for you.
I have faced the same issue in 2016-2017 while building a large experimental docker swarm environment. We were building a multi region docker swarm cluster with dns load balancing. This was a 50+ node swarm cluster.
At one time our ceph storage cluster crashed and took a lot of the swarm nodes down with it. When all the nodes came back online I was experiencing the same issues as you describe.
The good news:
What worked for me was stopping the docker service, reboot, restart docker. All the services running on the cluster magically reappeared as if nothing has happened.
The bad news:
This worked on most of the nodes. Some swarm masters never recovered. These nodes I simply destroyed and I just spinned up new nodes to add to the swarm.
EDIT: I have dug out some old scripts that I used for swarm recovery.
To restore a failed swarm manager you should first make a backup of the configuration and spin up a new instance.
mkdir /root/Backup
cp -rf /var/lib/docker/swarm /root/Backup
cp /root/Backup
tar -czvf swarm.tar.gz swarm/
scp -r user@new_host:/tmp
On the new host restore the config
cp swarm.tar /var/lib/docker
tar -xvf swarm.tar
Drain your worker nodes
docker node update -availability drain [node]
Update all your running services
docker service update --force
Upvotes: 1