Ben Futterleib
Ben Futterleib

Reputation: 146

Intermittent 502 gateway errors with AWS ALB in front of ECS services running express / nginx

Backgound:

We are running a single page application being served via nginx with a node js (v12.10) backend running express. It runs as containers via ECS and currently we are running three t3a mediums as our container instances with the api and web services each running 6 replicas across these. We use an ALB to handle our load balancing / routing of requests. We run three subnets across 3 AZ's with the load balancer associated with all three and the instances spread across the 3 AZ's as well.

Problem:

We are trying to get to the root cause of some intermittent 502 errors that are appearing for both front and back end. I have downloaded the ALB access logs and the interesting thing about all of these requests is that they all show the following. - request_processing_time: 0.000 - target_processing_time: 0.000 (sometimes this will be 0.001 or at most 0.004) - response_processing_time: -1

At the time of these errors I can see that there were healthy targets available.

Now I know that some people have had issues like this with keepAlive times that were shorter on the server side than on the ALB side, therefore connections were being forceably closed that the ALB then tries to reuse (which is in line with the guidelines for troubleshooting on AWS). However when looking at the keepAlive times for our back end they are set higher than our ALB currently by double. Also the requests themselves can be replayed via chrome dev tools and they succeed (im not sure if this is a valid way to check a malformed request, it seemed reasonable).

I am very new to this area and if anyone has some suggestions as to where to look or what sort of tests to run that might help me pinpoint this issue it would be greatly appreciated. I have run some load tests on certain endpoints and duplicated the 502 errors, however the errors under heavy load differ from the intermittent ones I have seen on our logs in that the target_processing_time is quite high so to my mind this is another issue altogether. At this stage I would like to understand the errors that show a target_processing_time of basically zero to start with.

Upvotes: 4

Views: 2697

Answers (1)

cheeseandcereal
cheeseandcereal

Reputation: 151

I wrote a blog post about this a bit over a year ago that's probably worth taking a look at (caused due to a behavior change in NodeJS 8+):

https://adamcrowder.net/posts/node-express-api-and-aws-alb-502/

TL;DR is you need to set the nodejs http.Server keepAliveTimeout (which is in ms) to be higher than the load balancer's idle timeout (which is in seconds).

Please also note that there is also something called an http-keepalive which sets an http header, which has absolutely nothing to do with this problem. Make sure you're setting the right thing.

Also note that there is currently a regression in nodejs where setting the keepAliveTimeout may not work properly. That bug is being tracked here: https://github.com/nodejs/node/issues/27363 and is worth looking through if you're still having this problem (you may need to also set headersTimeout as well).

Upvotes: 5

Related Questions