Reputation: 5140
I have a Python/Flask web application that I am deploying via Gunicorn in a docker image on Amazon ECS. Everything is going fine, and then suddenly, including the last successful request, I see this in the logs:
[2017-03-29 21:49:42 +0000] [14] [DEBUG] GET /heatmap_column/e4c53623-2758-4863-af06-91bd002e0107/ADA
[2017-03-29 21:49:43 +0000] [1] [INFO] Handling signal: term
[2017-03-29 21:49:43 +0000] [14] [INFO] Worker exiting (pid: 14)
[2017-03-29 21:49:43 +0000] [8] [INFO] Worker exiting (pid: 8)
[2017-03-29 21:49:43 +0000] [12] [INFO] Worker exiting (pid: 12)
[2017-03-29 21:49:43 +0000] [10] [INFO] Worker exiting (pid: 10)
...
[2017-03-29 21:49:43 +0000] [1] [INFO] Shutting down: Master
And the processes die off and the program exits. ECS then restarts the service, and the docker image is run again, but in the meanwhile the service is interrupted.
What would be causing my program to get a TERM signal? I can't find any references to this happening on the web. Note that this only happens in Docker on ECS, not locally.
Upvotes: 18
Views: 25993
Reputation: 1705
If you have a health check set up, a long-ish request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.
In my case, the worker was being killed by the liveness probe in Kubernetes! I have a gunicorn app with a single uvicorn worker, which only handles one request at a time. It worked fine locally but would have the worker sporadically killed when deployed to kubernetes. It would only happen during a long-ish call that takes about 25 seconds. But it would not happen every time!
It turned out that my liveness check was configured to hit a different endpoint in the same service every 10 seconds, time out in 1 second, and retry 3 times. So if this gives 10+1*3 ~ 13 seconds, and you can see that it would trigger some times but not always.
The solution, if this is your case, is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads - something that makes sure that the health check is not blocked for long enough to trigger worker kill.
You can see that adding more workers may help with (or hide) the problem.
Upvotes: 1
Reputation: 393
For me, it turned out that the worker was quitting due to one of the containers in my Docker Swarm stack was failing repeatedly, resulting in the rollback process. The gunicorn process received the signal 'term' when the rollback process began.
Upvotes: 0
Reputation: 61
To add onto rjurney's comment, on the AWS console for ECS, you can check the status of your application by checking the Events tab of the Service that is running under your ECS cluster. That's how I found out about the failing health checks and other issues.
Upvotes: 3
Reputation: 9971
While not specifically applicable to the problem in the question, this behavior can be caused by external systems like container orchestration (i.e. Kubernetes).
For example,
In the Kubernetes scenario, one solution might be to adjust the liveness or readiness probe configurations to allow for longer startup times.
Upvotes: 3
Reputation: 5140
It turned out that after adding a login page to the system, the health check was getting a 302 redirect to /login at /, which was failing the health check. So the container was periodically killed. Amazon support is awesome!
Upvotes: 24