Reputation: 5140

Why are my gunicorn Python/Flask workers exiting from signal term?

I have a Python/Flask web application that I am deploying via Gunicorn in a docker image on Amazon ECS. Everything is going fine, and then suddenly, including the last successful request, I see this in the logs:

[2017-03-29 21:49:42 +0000] [14] [DEBUG] GET /heatmap_column/e4c53623-2758-4863-af06-91bd002e0107/ADA [2017-03-29 21:49:43 +0000] [1] [INFO] Handling signal: term [2017-03-29 21:49:43 +0000] [14] [INFO] Worker exiting (pid: 14) [2017-03-29 21:49:43 +0000] [8] [INFO] Worker exiting (pid: 8) [2017-03-29 21:49:43 +0000] [12] [INFO] Worker exiting (pid: 12) [2017-03-29 21:49:43 +0000] [10] [INFO] Worker exiting (pid: 10) ... [2017-03-29 21:49:43 +0000] [1] [INFO] Shutting down: Master

And the processes die off and the program exits. ECS then restarts the service, and the docker image is run again, but in the meanwhile the service is interrupted.

What would be causing my program to get a TERM signal? I can't find any references to this happening on the web. Note that this only happens in Docker on ECS, not locally.

Upvotes: 18

Answers (5)

Gena Kukartsev

Reputation: 1705

If you have a health check set up, a long-ish request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.

In my case, the worker was being killed by the liveness probe in Kubernetes! I have a gunicorn app with a single uvicorn worker, which only handles one request at a time. It worked fine locally but would have the worker sporadically killed when deployed to kubernetes. It would only happen during a long-ish call that takes about 25 seconds. But it would not happen every time!

It turned out that my liveness check was configured to hit a different endpoint in the same service every 10 seconds, time out in 1 second, and retry 3 times. So if this gives 10+1*3 ~ 13 seconds, and you can see that it would trigger some times but not always.

The solution, if this is your case, is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads - something that makes sure that the health check is not blocked for long enough to trigger worker kill.

You can see that adding more workers may help with (or hide) the problem.

Upvotes: 1

shapale

Reputation: 393

For me, it turned out that the worker was quitting due to one of the containers in my Docker Swarm stack was failing repeatedly, resulting in the rollback process. The gunicorn process received the signal 'term' when the rollback process began.

Upvotes: 0

bonicim

Reputation: 61

To add onto rjurney's comment, on the AWS console for ECS, you can check the status of your application by checking the Events tab of the Service that is running under your ECS cluster. That's how I found out about the failing health checks and other issues.

Upvotes: 3

Joe Hawkins

Reputation: 9971

While not specifically applicable to the problem in the question, this behavior can be caused by external systems like container orchestration (i.e. Kubernetes).

For example,

A pod built from an image with high startup cost starts
The liveness probe times out
Kubernetes sends sig term to gracefully stop the container

In the Kubernetes scenario, one solution might be to adjust the liveness or readiness probe configurations to allow for longer startup times.

Upvotes: 3

rjurney

Reputation: 5140

It turned out that after adding a login page to the system, the health check was getting a 302 redirect to /login at /, which was failing the health check. So the container was periodically killed. Amazon support is awesome!

Upvotes: 24

Why are my gunicorn Python/Flask workers exiting from signal term?

Answers (5)

Related Questions