amazon-web-servicesnginxamazon-ec2uwsgiamazon-elb

Reputation: 715

Unexplained latency with AWS ELB

We've been recently experiencing unexplained latency issues as refelcting from the ELB latency metric with our AWS setup.

Our setup includes and 3 EC2 c1.medium machines (each running an NGINX which talks to a uWSGI handler on the machine) behind an ELB.

Now, our traffic has peaks in morning and evening times but that doens't explain what we're seeing, i.e peaks of 10 seconds in latency well into the the traffic peak.

Our NGINX logs and uWSGI stats show that we are not queuing any requests and response times are solid under 500 ms.

Some config details:

ELB listens on port 8443 and transfers to 8080

NGINX has the following config on each EC2:

worker_processes 2;
pid /var/run/nginx.pid;

events {
    worker_connections 4000;
    multi_accept on;
    use epoll;
}

http {
    server {
        reset_timedout_connection on;
        access_log off;
        listen 8080;
        
        location / {
            include uwsgi_params;
            uwsgi_pass 127.0.0.1:3031;
        }
    }
}

I was wondering if someone had experienced something similar or can maybe supply an explanation.

Thank you..

Upvotes: 2

Answers (2)

Elad Nava

Reputation: 7896

When used alongside nginx, AWS Classic Load Balancer will return a 504 Gateway Timeout if the client failed to send the entire request headers & body before the ELB Idle Timeout elapses or before the nginx client_body_timeout (default: 60s) elapses, due to a patchy mobile connection, for example. This is especially prevalent when clients send large requests bodies and time-out mid-send.

This is extremely counter-intuitive, as 5xx HTTP errors should always indicate an issue at the server/backend side, not the client side. The ELB should instead be returning a HTTP 408: Request timeout, however, due to an nginx bug (#1005 client_body_timeout does not send 408 as advertised) a 408 is never sent back by nginx. The connection gets terminated, and ELB sends back a 504.

The ELB access logs will contain entries with - as the backend instance, as the request never actually reached any backend since the client never finished sending the entire request:

1.2.3.4:59364 - -1 -1 -1 504 0 8192 0 "POST https://elb.website.com:443/api-path HTTP/1.1" "Dalvik/2.1.0 (Linux; U; Android 13;)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2

Looking at the Latency metric for the ELB and setting it to Maximum instead of Average, you'll notice that the longest request takes roughly the same time as the ELB Idle Timeout, meaning that the request timed out.

However, it might not be your backend that timed out - it may be the client. The client may have failed to send the entire request headers & body before the Idle Timeout elapsed.

It means that if your ELB serves mobile devices connected to patchy networks, your Average Latency metric will be impacted by slower requests sent by these devices, and the ELB 5xxs metric will report 5xx errors sent to clients who failed to send their request in time.

This isn't documented anywhere on AWS official docs, as far as I'm aware. The docs always point to a timeout problem in the backend itself, and not the client having a patchy connection.

You can test this out on your own ELB using this Node.js script which creates a sample POST request with a Content-Length longer by 500 characters than the actual body sent, forcing ELB to wait for the client to send more data, and inevitably causing it to close the connection with a 504 Gateway Timeout:

const https = require('https');

const postData = JSON.stringify({
  name: 'John Doe',
  age: 30
});

const options = {
  hostname: 'YOUR_ELB_HOSTNAME', // Replace with your ELB hostname, no https://
  port: 443,
  path: '/sample-path',
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Content-Length': Buffer.byteLength(postData) + 500
  }
};

const req = https.request(options, (res) => {
  let data = '';

  // Collect response data
  res.on('data', (chunk) => {
    data += chunk;
  });

  // End of response
  res.on('end', () => {
    console.log('Status code: ' + res.statusCode + ' ' + res.statusMessage);
  });
});

// Handle errors
req.on('error', (e) => {
  console.error(`Problem with request: ${e.message}`);
});

// Write the data to the request body
req.write(postData);

// End the request
req.end();

Replace YOUR_ELB_HOSTNAME with your own ELB hostname.

When running this script and waiting for the Idle Timeout to elapse, the console will print a 504 GATEWAY_TIMEOUT status code eventually, and the Latency and ELB 5xxs graphs in the AWS Console will indeed count the request as a 5xx and raise your overall Average Latency for the ELB.

Workarounds:

One way around this is to lower the nginx client_body_timeout and client_header_timeout values to a value lower than the ELB Idle Timeout. The request will time out earlier than the ELB Idle Timeout, and impact the Latency graph less, however, ELB will still throw a 5xx.
A better alternative is reducing the client_max_body_size from the default 1mb to something more sensible based on your expected request body size. When clients try to send large bodies, the request will immediately be rejected, so nginx won't end up waiting until the client_body_timeout elapses and close the socket:

# Override max POST body size, set to 4kb (4096 byte)
client_max_body_size  4k;

You can also reduce the client_max_body_size for a specific path using a new location {} block declaration.

With this approach, the request won't end up increasing the overall Average Latency, nor will it count as a 5xx, as nginx will immediately return 413 Request Entity Too Large.

Another alternative is setting proxy_request_buffering to off and configuring a request & header timeout in the application, so that your application sends back a 408 response if the timeout is exceeded. In Node.js it's possible to do so using the following code:

const server = http.createServer((req, res) => {
  res.writeHead(200, { 'Content-Type': 'text/plain' });
  res.end('okay');
});

// Wait for client body up to 50 seconds
server.timeout = 50000;

server.on('timeout', (socket) => {
    // Send HTTP 408
    socket.end('HTTP/1.1 408 Request Timeout\r\n\r\n');
});

With this approach, the ELB will return 408 Request Timeout instead of a 504 Gateway Timeout. However, the Latency graph will still depict the slow request and this will affect the average latency.

Another option is to migrate to an Application Load Balancer (ALBs will hang up the socket instead of counting the request as a 504, also, the slow request won't bump up the Target Response Time which is the Latency metric for the ALB. However, ALBs are generally more expensive than CLBs.)
Another option is to use Apache instead of nginx. Apache actually does send back a 408 when the client times out, instead of terminating the connection abruptly like nginx.
You may also consider blocking the repeat offenders' IP addresses via VPC Network ACLs inbound deny rules. However, mobile devices change their IP address frequently as they transition from cellular tower to tower.

If you're monitoring your ELB 5xxs metric, you might want to increase the alarm threshold to > 1 so that you aren't alerted every time a request from a patchy mobile device times out. As for the Latency - you might want to switch to IQM (Interquartile mean) instead of Average so that the client timeouts don't affect your ELB Latency as a whole.

Upvotes: 0

Rico

Reputation: 61689

I'm not sure if it's documented somewhere but we've been using ELBs for quite a while. And in essence ELBs are EC2 instances in front of the instances you are load balancing, it's our understanding that when your ELB starts experiencing more traffic, Amazon does some magic to turn that ELB instance from say a c1.medium to an m1.xlarge.

So it could be that when you are starting to see peaks Amazon does some transitioning between the smaller to the larger ELB instance and you are seeing those delays.

Again customers don't know what goes on inside Amazon so for all you know they could be experiencing heavy traffic at the same time you have your peaks and their load balancers are going berserk.

You could probably avoid these delays by over-provisioning but who wants to spend more money.

There a couple of things that I would recommend if you have time and resources:

Setup an haproxy instance in front of your environment (some large instance) and monitor your traffic that way. Haproxy has a command line (or web) utility that allows you to see stats. Of course you also need to monitor your instance for things like CPU and memory.
You may not be able to do in production in which case you are going to have to run test traffic through it. I recommend using something like loader.io. Another options is to try to partially send some of the traffic to an haproxy instance, perhaps using GSLB (if your DNS provider supports it)

Upvotes: 2

Unexplained latency with AWS ELB

Some config details:

Answers (2)

Related Questions