Reputation: 3079
I have two server environments running the same app. The first, which I intend to abandon, is a Standard Google App Engine environment that has many limitations. The second one is a Google Kubernetes cluster running my Python app with Gunicorn.
At the first server, I can send multiple requests to the app and it will answer many simultaneously. I run two batches of simultaneous requests against the app on both environments. At Google App Engine the first batch and the second were responded simultaneously and the first din't block the second.
At the Kubernetes, the server only responses 6 simultanous, and the first batch blocks the second. I've read some posts on how to achieve Gunicorn concurrency with gevent or multiple threading, and all of them say I need to have CPU cores, but the problem is that no matter how much cpu I put into it, the limitation continues. I've tried Google nodes from 1VCPU to 8VCPU and it doesn't change much.
Can you guys give me any ideas on what I'm possibly missing? Maybe Google Cluster nodes limitation?
As you can notice, the second batch only started to be responded after the first one started to finish.
Upvotes: 6
Views: 5313
Reputation: 1029
What you describe appears to be an indicator that you running the Gunicorn server with the sync worker class serving an I/O bound application. Can you share your Gunicorn configuration?
Is it possible that Google's platform has some kind of autoscaling feature (I'm not really familiar with their service) that's being triggered while your Kubernetes configuration does not?
Generically speaking increasing the number cores for a single instance will only help if you also increase the number of workers spawned to attend incoming requests. Please see the Gunicorn's design documentation with a special emphasis on the worker types section (and why sync
workers are suboptimal for I/O bound applications) - its a good read and provides a more detailed explanation about this problem.
Just for fun, here's a small exercise to compare the two approaches:
import time
def app(env, start_response):
time.sleep(1) # takes 1 second to process the request
start_response('200 OK', [('Content-Type', 'text/plain')])
return [b'Hello World']
Running Gunicorn with 4 sync workers: gunicorn --bind '127.0.0.1:9001' --workers 4 --worker-class sync --chdir app app:app
Let's trigger 8 request at the same time: ab -n 8 -c 8 "http://localhost:9001/"
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient).....done
Server Software: gunicorn/19.8.1
Server Hostname: localhost
Server Port: 9001
Document Path: /
Document Length: 11 bytes
Concurrency Level: 8
Time taken for tests: 2.007 seconds
Complete requests: 8
Failed requests: 0
Total transferred: 1096 bytes
HTML transferred: 88 bytes
Requests per second: 3.99 [#/sec] (mean)
Time per request: 2006.938 [ms] (mean)
Time per request: 250.867 [ms] (mean, across all concurrent requests)
Transfer rate: 0.53 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.2 1 1
Processing: 1003 1504 535.7 2005 2005
Waiting: 1002 1504 535.8 2005 2005
Total: 1003 1505 535.8 2006 2006
Percentage of the requests served within a certain time (ms)
50% 2006
66% 2006
75% 2006
80% 2006
90% 2006
95% 2006
98% 2006
99% 2006
100% 2006 (longest request)
Around 2 seconds to complete the test. That's the behavior you got on your tests - the 4 first requests took kept your workers busy, the second batch was queued until the first batch was processed.
Same test, but let's tell Gunicorn to use an async worker: unicorn --bind '127.0.0.1:9001' --workers 4 --worker-class gevent --chdir app app:app
Same test as above: ab -n 8 -c 8 "http://localhost:9001/"
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient).....done
Server Software: gunicorn/19.8.1
Server Hostname: localhost
Server Port: 9001
Document Path: /
Document Length: 11 bytes
Concurrency Level: 8
Time taken for tests: 1.005 seconds
Complete requests: 8
Failed requests: 0
Total transferred: 1096 bytes
HTML transferred: 88 bytes
Requests per second: 7.96 [#/sec] (mean)
Time per request: 1005.463 [ms] (mean)
Time per request: 125.683 [ms] (mean, across all concurrent requests)
Transfer rate: 1.06 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.4 1 2
Processing: 1002 1003 0.6 1003 1004
Waiting: 1001 1003 0.9 1003 1004
Total: 1002 1004 0.9 1004 1005
Percentage of the requests served within a certain time (ms)
50% 1004
66% 1005
75% 1005
80% 1005
90% 1005
95% 1005
98% 1005
99% 1005
100% 1005 (longest request)
We actually double the application's throughput here - it only took ~1s to reply to all the requests.
To understand what happened Gevent has a great tutorial about its architecture and this article has a more in-depth explanation about co-routines.
I apologize in advance if was way off on the actual cause of your problem (I do believe that some additional information is lacking from your initial comment for anyone to have a conclusive answer). If not to you, I hope this'll helpful to someone else. :)
Also do notice that I've oversimplified things a lot (my example was a simple proof of concept), tweaking an HTTP server configuration is mostly a trial and error exercise - it's all dependent on the type of workload the application has and the hardware it sits on.
Upvotes: 10