Reputation: 6206
For some background on my environment:
I have docker swarm running on 3 ubuntu 14.04 vagrant boxes. The swarm master is running on 1 machine (with consul) and the other 2 machines are running swarm workers that are joined to the master. I set up the environment following the documentation page https://docs.docker.com/swarm/install-manual/. It is working correctly so that any docker -H :4000 <some_docker_command>
run from my master machine works fine. Service discovery is active as I am running the gliderlabs/registrator container on both of my workers.
The issue:
Any changes to my cluster, such as a node or container failure and the process of rescheduling containers (which are created with the tag -e "reschedule:on-node-failure"
) by swarm occur within about 30 - 45 seconds. By comparison when I was running fleet and etcd on CoreOS systems container rescheduling and notification of node failures occurred usually within about 5 seconds. Is there any way to change some of the settings within consul and docker swarm to speed everything up to a level similar to what I experienced with fleet and etcd on CoreOS? If so what would I need to do?
tldr: I am running swarm with consul, container reschedualing and changes to the output ofdocker -H :4000 ps
don't occur untill about 30 - 45 seconds after a node goes down. How can I reduce this time period?
Upvotes: 2
Views: 301
Reputation: 6173
You could try to set the TTL and retries to lower values to get the swarm manager to act faster on failures.
For example:
docker run swarm manage --engine-failure-retry=1 consul:x.y.z.a:8500
Full documentation
Upvotes: 0