Reputation: 536
Disclaimer: this is technically related to a school project, but I've talked to my professor and he is also confused by this.
I have a nginx load balancer reverse proxying to several uwsgi + flask apps. The apps are meant to handle very high throughput/load. My response times from uwsgi are pretty good, and the nginx server has low CPU usage and load average, but the overall request time is extremely high.
I've looked into this issue and all the threads I've found say that this is always caused by the client having a slow connection. However, the requests are being made by a script on the same network, and this issue isn't affecting anyone else's setup, so it seems to me that it's a problem with my nginx config. This has me totally stumped though because it seems almost unheard of for nginx to be the bottleneck like this.
To give an idea of the magnitude of the problem, there are three primary request types: add image, search, and add tweet (it's a twitter clone).
For add image, the overall request time is ~20x longer than the upstream response time on average. For search, it's a factor of 3, and add tweet 1.5. My theory for the difference here is that the amount of data being sent back and forth is much larger for add image than either search or add tweet, and larger for search than add tweet.
My nginx.conf is:
user www-data;
pid /run/nginx.pid;
worker_processes auto;
worker_rlimit_nofile 30000;
events {
worker_connections 30000;
}
http {
# Settings.
sendfile on;
tcp_nopush on;
tcp_nodelay on;
client_body_buffer_size 200K;
keepalive_timeout 65;
types_hash_max_size 2048;
include /etc/nginx/mime.types;
default_type application/octet-stream;
# SSL.
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_prefer_server_ciphers on;
# Logging
log_format req_time '$remote_addr - $remote_user [$time_local] '
'REQUEST: $request '
'STATUS: $status '
'BACK_END: $upstream_addr '
'UPSTREAM_TIME: $upstream_response_time s '
'REQ_TIME: $request_time s ';
'CONNECT_TIME: $upstream_connect_time s';
error_log /var/log/nginx/error.log;
access_log /var/log/nginx/access.log req_time;
# GZIP business
gzip on;
gzip_disable "msie6";
# Routing.
upstream media {
server 192.168.1.91:5000;
}
upstream search {
least_conn;
server 192.168.1.84:5000;
server 192.168.1.134:5000;
}
upstream uwsgi_app {
least_conn;
server 192.168.1.85:5000;
server 192.168.1.89:5000;
server 192.168.1.82:5000;
server 192.168.1.125:5000;
server 192.168.1.86:5000;
server 192.168.1.122:5000;
server 192.168.1.128:5000;
server 192.168.1.131:5000;
server 192.168.1.137:5000;
}
server {
listen 80;
server_name localhost;
location /addmedia {
include uwsgi_params;
uwsgi_read_timeout 5s;
proxy_read_timeout 5s;
uwsgi_pass media;
}
location /media {
include uwsgi_params;
uwsgi_read_timeout 5s;
proxy_read_timeout 5s;
uwsgi_pass media;
}
location /search {
include uwsgi_params;
uwsgi_read_timeout 5s;
proxy_read_timeout 5s;
uwsgi_pass search;
}
location /time-search {
rewrite /time-search(.*) /times break;
include uwsgi_params;
uwsgi_pass search;
}
location /item {
include uwsgi_params;
uwsgi_read_timeout 5s;
proxy_read_timeout 5s;
if ($request_method = DELETE) {
uwsgi_pass media;
}
if ($request_method = GET) {
uwsgi_pass uwsgi_app;
}
if ($request_method = POST) {
uwsgi_pass uwsgi_app;
}
}
location / {
include uwsgi_params;
uwsgi_read_timeout 5s;
proxy_read_timeout 5s;
uwsgi_pass uwsgi_app;
}
}
}
And my uwsgi ini is:
[uwsgi]
chdir = /home/ubuntu/app/
module = app
callable = app
master = true
processes = 25
socket = 0.0.0.0:5000
socket-timeout = 5
die-on-term = true
home = /home/ubuntu/app/venv/
uid = 1000
buffer-size=65535
single-interpreter = true
Any insights as to the cause of this problem would be greatly appreciated.
Upvotes: 3
Views: 3418
Reputation: 536
So, I think I figured this out. From reading the nginx docs (https://www.nginx.com/blog/using-nginx-logging-for-application-performance-monitoring/) it seems that there are three metrics to pay attention to: upstream_response_time, request_time, and upstream_connect_time. I was focused on the difference between upstream_response_time and request_time.
However, upstream_response_time is the time between the upstream accepting the request and returning a response. It doesn't include upstream_connect time, or the time it takes to establish a connection to upstream server. And in the context of uwsgi, this is very important, because if there isn't a worker available to accept a request, the request will get put on a backlog. I think the time a request waits on the backlog might count as upstream_connect_time, not upstream_response_time in nginx, because uwsgi hasn't read any of the bytes yet.
Unfortunately, I can't be 100% certain, because I never got a "slow" run where I was logging upstream_connect_time. But the only things I changed that improved my score were just "make the uwsgi faster" changes (devote more cores to searching, increase replication factor in the DB to make searches faster)... So yeah, turns out the answer was just to increase throughput for the apps.
Upvotes: 4