Alex Cohen
Alex Cohen

Reputation: 6206

How to prevent zombie services with consul and gliderlabs/registrator?

I am using consul with the gliderlabs/registrator container to show my active containers in consul. When I delete containers too quickly the service is not deleted from consul, leaving "zombie" services that no longer exist. I have heard that there are extra options you can use with the gliderlabs/registrator container to prevent this such as -cleanup. However I have not been able to run any registrator's sucessfully with this option. This is my current docker run command for my registrators:

docker run -d -h $(hostname -i) --name registrator1 \
-v /var/run/docker.sock:/tmp/docker.sock gliderlabs/registrator \
consul://$(hostname -i):8500

What do I have to add to this run command to have registrator remove any containers from consul that no longer exist or have gone down?

Update: I have found the problem

So I am running swarm using my consul cluster along with registrator's. To provide failover for swarm I placed a load balancer in front of my consul cluster and attached my swarm and registrator containers to the IP address of the load balancer. This allowed any consul node to go down without loosing swarm.

However swarm does not register itself as a service. It registers each node as a key value and is not bound to any node in the consul cluster. Containers registered into consul with the registrator are created as services and bound to a single consul server.

I think whats happening is that when I delete a container the registrator goes to delete the service from consul but it only has a 33% chance of hitting the right consul server, and deleting the service, since my LB is just doing round robin.

All of my swarm masters, load balancer, consul servers and swarm workers are running on different machines. My registrators are running on my swarm worker machines. Everything is running in containers.

Enabling sticky load balancing is a temporary fix that solves my issue. However I think attempting to run some type of consul workers on my swarm workers and having registrator bind itself to the consul running on the local hosts may be the solution. I believe this might be the "bench-worker" described in consuls github https://github.com/hashicorp/consul/tree/master/bench. I am still fairly new to consul so I'm still trying to figure it all out.

Upvotes: 1

Views: 1041

Answers (1)

Alex Cohen
Alex Cohen

Reputation: 6206

The answer was to run consul workers, formally known as consul clients, on all my swarm worker nodes. This can be done by just removing the -server tag from my progrium/consul run commands. Then my registrators just reported to the consul clients running on each machine instead of binding themselves to the consul servers. Due to the fact that progrium/consul is outdated and no longer maintained, there still is the issue of zombies that appear when a container is stopped ungracefully (i.e. any way other than docker stop) and removed afterwards.

Upvotes: 0

Related Questions