brandonhilkert
brandonhilkert

Reputation: 4475

Flask app locks up with multiple processes

Gunicorn: 19.9.0

Flask: 1.0.2

Python: 3.6.7

We have a bunch of internal APIs server data science models with multiple thousands of req./sec. We recently introduced a new one and for whatever reason, when served with multiple processes (Gunicorn is our default), it'll server a few hundred requests and just lock up.

If I run the API as a bare file without Gunicorn, the following works ok:

app.run(ip, port=port, threaded=True)

If I run with multiple processes, it locks up shortly after starting:

app.run(ip, port=port, threaded=False, processes=2)

If I use Gunicorn with workers=1, it locks up too, here's the config:

preload_app = False
bind = "0.0.0.0:{}".format(8889)
workers = 1
debug = False
timeout = 120

I've commented out all code in the endpoint and that's had no effect on it locking up. It feels like some kind of conflict with a dependency, but I'm having trouble pinpointing it.

If I try to attach using strace while it's locked, I get a tight loop with the following output on the master gunicorn process:

strace: Process 4387 attached
select(4, [3], [], [], {tv_sec=0, tv_usec=832486}) = 0 (Timeout)
fstat(6, {st_mode=S_IFREG|001, st_size=0, ...}) = 0
select(4, [3], [], [], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
fstat(6, {st_mode=S_IFREG|001, st_size=0, ...}) = 0
select(4, [3], [], [], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
fstat(6, {st_mode=S_IFREG|001, st_size=0, ...}) = 0
select(4, [3], [], [], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
fstat(6, {st_mode=S_IFREG|001, st_size=0, ...}) = 0

Any suggestions on where to go or what to try at this point?

Upvotes: 1

Views: 864

Answers (1)

brandonhilkert
brandonhilkert

Reputation: 4475

It appears it was due to the combinations of number of clients and lack of reverse proxy (e.g. nginx) in front of it. There wasn't enough workers available to start queueing requests as compared to the number clients, which overwhelmed the workers to the point where they would stop responding. I bumped the workers to 60 and there's much more consistent throughput.

Upvotes: 2

Related Questions