siddharth.nair
siddharth.nair

Reputation: 149

CRITICAL WORKER TIMEOUT on gunicorn when deployed to AWS

I have a flask web-app that uses a gunicorn server and I have used the gevent worker class as that previously helped me not get [CRITICAL] WORKER TIMEOUT issues before but since I have deployed it on to AWS behind an ELB, I seem to be getting this issue again.

I have tried eventlet worker class before and that didn't work but gevent did locally

This is the shell script that I have used as an entrypoint for my Dockerfile:

gunicorn -b 0.0.0.0:5000 --worker-class=gevent --worker-connections 1000 --timeout 60 --keep-alive 20 dataclone_controller:app

When i check the logs on the pods, this is the only information that gets printed out:

[2019-09-04 11:36:12 +0000] [8] [INFO] Starting gunicorn 19.9.0
   [2019-09-04 11:36:12 +0000] [8] [INFO] Listening at: 
   http://0.0.0.0:5000 (8)
   [2019-09-04 11:36:12 +0000] [8] [INFO] Using worker: gevent
   [2019-09-04 11:36:12 +0000] [11] [INFO] Booting worker with pid: 11
   [2019-09-04 11:38:15 +0000] [8] [CRITICAL] WORKER TIMEOUT (pid:11)

Upvotes: 14

Views: 2686

Answers (1)

Shaheed Haque
Shaheed Haque

Reputation: 723

For our Django application, we eventually tracked this down to memory exhaustion. This is difficult to track down because the AWS monitoring does not provide memory statistics (at least by default) and even if it did, its not clear how easy a transient spike would be to see.

Additional symptoms included:

  • We would often lose network connectivity to the VM at this point.
  • /var/log/syslog contained some evidence of some processes restarting (in our case, this was mostly Hashicorp's Consul).
  • There was no evidence of the Linux OOM detection coming into play.
  • We knew the system was busy because the AWS CPU stats would often show a spike (to say 60%).

The fix for us lay in judicious conversion of Django queries which looked like this:

   for item in qs:
       do_something()

to use .iterator() like this:

CHUNK_SIZE = 5
...
   for item in qs.iterator(CHUNK_SIZE):
       do_something()

which effectively trades database round-trips for lower memory usage. Note that CHUNK_SIZE = 5 made sense because we were fetching some database objects with big columns of JSONB. I expect that more typical usage might use a number several orders of magnitude larger.

Upvotes: 4

Related Questions