Reputation: 3253

Docker swarm: guarantee high availability after restart

I have an issue using Docker swarm.

I have 3 replicas of a Python web service running on Gunicorn.

The issue is that when I restart the swarm service after a software update, an old running service is killed, then a new one is created and started. But in the short period of time when the old service is already killed, and the new one didn't fully start yet, network messages are already routed to the new instance that isn't ready yet, resulting in 502 bad gateway errors (I proxy to the service from nginx).

I use --update-parallelism 1 --update-delay 10s options, but this doesn't eliminate the issue, only slightly reduces chances of getting the 502 error (because there are always at least 2 services running, even if one of them might be still starting up).

Upvotes: 1

Answers (2)

BMitch

Reputation: 263627

There are some known issues (e.g. moby/moby #30321) with rolling upgrades in docker swarm with the current 17.05 and earlier releases (and doesn't look like all the fixes will make 17.06). These issues will result in connection errors during a rolling upgrade like you're seeing.

If you have a true zero downtime deployment requirement and can't solve this with a client side retry, then I'd recommend putting in some kind of blue/green switch in front of your swarm and do the rolling upgrade to the non-active set of containers until docker finds solutions to all of the scenarios.

Upvotes: 0

Robert

Reputation: 36773

So, following what I've proposed in comments:

Use the HEALTHCHECK feature of Dockerfile: Docs. Something like:

HEALTHCHECK --interval=5m --timeout=3s \
  CMD curl -f http://localhost/ || exit 1

Knowing that Docker Swarm does honor this healthcheck during service updates, it's relative easy to have a zero downtime deployment.

But as you mentioned, you have a high-resource consumer health-check, and you need larger healthcheck-intervals.

In that case, I recomend you to customize your healthcheck doing the first run immediately and the successive checks at current_minute % 5 == 0, but the healthcheck itself running /30s:

HEALTHCHECK --interval=30s --timeout=3s \
  CMD /service_healthcheck.sh

healthcheck.sh

#!/bin/bash

CURRENT_MINUTE=$(date +%M)
INTERVAL_MINUTE=5

[ $((a%2)) -eq 0 ]
do_healthcheck() {
  curl -f http://localhost/ || exit 1
}

if [ ! -f /tmp/healthcheck.first.run ]; then
  do_healhcheck
  touch /tmp/healthcheck.first.run
  exit 0
fi

# Run only each minute that is multiple of $INTERVAL_MINUTE
[ $(($CURRENT_MINUTE%$INTERVAL_MINUTE)) -eq 0 ] && do_healhcheck
exit 0

Remember to COPY the healthcheck.sh to /healthcheck.sh (and chmod +x)

Upvotes: 0

Docker swarm: guarantee high availability after restart

Answers (2)

Related Questions