Reputation: 3129

Gunicorn does not repondes more than 6 requests at a time

To give you some context:

I have two server environments running the same app. The first, which I intend to abandon, is a Standard Google App Engine environment that has many limitations. The second one is a Google Kubernetes cluster running my Python app with Gunicorn.

Concurrency

At the first server, I can send multiple requests to the app and it will answer many simultaneously. I run two batches of simultaneous requests against the app on both environments. At Google App Engine the first batch and the second were responded simultaneously and the first din't block the second.

At the Kubernetes, the server only responses 6 simultanous, and the first batch blocks the second. I've read some posts on how to achieve Gunicorn concurrency with gevent or multiple threading, and all of them say I need to have CPU cores, but the problem is that no matter how much cpu I put into it, the limitation continues. I've tried Google nodes from 1VCPU to 8VCPU and it doesn't change much.

Can you guys give me any ideas on what I'm possibly missing? Maybe Google Cluster nodes limitation?

Kubernetes response waterfall

As you can notice, the second batch only started to be responded after the first one started to finish.

App Engine response waterfall

Upvotes: 6

Answers (1)

Diogo

Reputation: 1029

What you describe appears to be an indicator that you running the Gunicorn server with the sync worker class serving an I/O bound application. Can you share your Gunicorn configuration?

Is it possible that Google's platform has some kind of autoscaling feature (I'm not really familiar with their service) that's being triggered while your Kubernetes configuration does not?

Generically speaking increasing the number cores for a single instance will only help if you also increase the number of workers spawned to attend incoming requests. Please see the Gunicorn's design documentation with a special emphasis on the worker types section (and why sync workers are suboptimal for I/O bound applications) - its a good read and provides a more detailed explanation about this problem.

Just for fun, here's a small exercise to compare the two approaches:

import time

def app(env, start_response):
    time.sleep(1) # takes 1 second to process the request
    start_response('200 OK', [('Content-Type', 'text/plain')])
    return [b'Hello World']

Running Gunicorn with 4 sync workers: gunicorn --bind '127.0.0.1:9001' --workers 4 --worker-class sync --chdir app app:app

Let's trigger 8 request at the same time: ab -n 8 -c 8 "http://localhost:9001/"

This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient).....done


Server Software:        gunicorn/19.8.1
Server Hostname:        localhost
Server Port:            9001

Document Path:          /
Document Length:        11 bytes

Concurrency Level:      8
Time taken for tests:   2.007 seconds
Complete requests:      8
Failed requests:        0
Total transferred:      1096 bytes
HTML transferred:       88 bytes
Requests per second:    3.99 [#/sec] (mean)
Time per request:       2006.938 [ms] (mean)
Time per request:       250.867 [ms] (mean, across all concurrent requests)
Transfer rate:          0.53 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   0.2      1       1
Processing:  1003 1504 535.7   2005    2005
Waiting:     1002 1504 535.8   2005    2005
Total:       1003 1505 535.8   2006    2006

Percentage of the requests served within a certain time (ms)
  50%   2006
  66%   2006
  75%   2006
  80%   2006
  90%   2006
  95%   2006
  98%   2006
  99%   2006
 100%   2006 (longest request)

Around 2 seconds to complete the test. That's the behavior you got on your tests - the 4 first requests took kept your workers busy, the second batch was queued until the first batch was processed.

Same test, but let's tell Gunicorn to use an async worker: unicorn --bind '127.0.0.1:9001' --workers 4 --worker-class gevent --chdir app app:app

Same test as above: ab -n 8 -c 8 "http://localhost:9001/"

This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient).....done


Server Software:        gunicorn/19.8.1
Server Hostname:        localhost
Server Port:            9001

Document Path:          /
Document Length:        11 bytes

Concurrency Level:      8
Time taken for tests:   1.005 seconds
Complete requests:      8
Failed requests:        0
Total transferred:      1096 bytes
HTML transferred:       88 bytes
Requests per second:    7.96 [#/sec] (mean)
Time per request:       1005.463 [ms] (mean)
Time per request:       125.683 [ms] (mean, across all concurrent requests)
Transfer rate:          1.06 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   0.4      1       2
Processing:  1002 1003   0.6   1003    1004
Waiting:     1001 1003   0.9   1003    1004
Total:       1002 1004   0.9   1004    1005

Percentage of the requests served within a certain time (ms)
  50%   1004
  66%   1005
  75%   1005
  80%   1005
  90%   1005
  95%   1005
  98%   1005
  99%   1005
 100%   1005 (longest request)

We actually double the application's throughput here - it only took ~1s to reply to all the requests.

To understand what happened Gevent has a great tutorial about its architecture and this article has a more in-depth explanation about co-routines.

I apologize in advance if was way off on the actual cause of your problem (I do believe that some additional information is lacking from your initial comment for anyone to have a conclusive answer). If not to you, I hope this'll helpful to someone else. :)

Also do notice that I've oversimplified things a lot (my example was a simple proof of concept), tweaking an HTTP server configuration is mostly a trial and error exercise - it's all dependent on the type of workload the application has and the hardware it sits on.

Upvotes: 10