ittupelo
ittupelo

Reputation: 901

504 errors from Elastic Load Balancer using Tomcat

I have a application running on multiple EC2 instances and served by Apache Tomcat. I've set up an AWS Elastic Load Balancer in front of the application, and everything basically works as expected. However, I will occasionally get a random 504 timeout error from the ELB. This doesn't seem to be related to load, as I've seen the errors under light load and heavy load. Also, it doesn't seem to occur in any regular pattern or situation.

Earlier in my testing, I was getting 504 errors because my application was taking longer to respond than the default 60 second timeout on ELB. I resolved that by bumping up the ELB timeout to the level necessary for my app. However, the 504 errors I'm getting now are happening very quickly. So, for example, one error I saw was on a request with a response time of about a second. It seems odd to be getting a timeout error when the request can't possibly have timed out on the application server.

This may be a similar issue to this question, though I couldn't quite tell from the information presented. Also, I don't have an additional load balancer in the mix, just ELB straight to Tomcat.

Upvotes: 9

Views: 7038

Answers (2)

ittupelo
ittupelo

Reputation: 901

So, after some more digging, I've found the issue. This page was helpful in solving the mystery by explaining some details about idle and keepalive timeouts:

There are two immediate causes for receiving a 504 from an ELB:

  1. The application actually took longer than the ELB's connection timeout to respond. This is a slow timeout — the 504 will typically be returned after a number of seconds, with the default for an ELB being 60 seconds. In this case, it is necessary either to increase the ELB's connection timeout, or improve application performance.
  2. The application did not respond to the ELB at all, instead closing its connection when data was requested. This is a fast timeout — the 504 will typically be returned in a matter of milliseconds, well under the ELB's timeout setting.

The first scenario was what I had seen and resolved by raising the ELB timeout. The second scenario describes the confusing behavior I was seeing after raising the ELB timeout. My log files had the "-1 -1 -1" pattern like the example logs from the article:

2015-12-11T13:42:07.736195Z my-elb 10.0.0.1:59893 - -1 -1 -1 504 0 0 0 "GET http://my-elb/ HTTP/1.1" "curl/7.19.7" - -

From the conclusion:

In short, an ELB's connection timeout must be set lower than both the application's idle and keepalive timeouts to prevent spurious 504s from being generated.

At some point during development before I started using ELB, I set the Tomcat timeout such that it happened to be higher than the default ELB timeout. When I bumped up the ELB timeout, I made it higher than the connectionTimeout I had set in Tomcat. Raising connectionTimeout to be slightly higher than my new ELB timeout got rid of the mystery 504 errors. So, I've now gotten rid of both the "slow" and "fast" timeout errors.

Tomcat also has a keepAliveTimeout setting which defaults to be the same as connectionTimeout if not set. I didn't have it set, so modifying connectionTimeout was enough to resolve my issue.

Upvotes: 5

Tom Harrison
Tom Harrison

Reputation: 14068

The ELB is not likely to be the cause of a problem, but instead showing that you have one. The 504 error is Gateway Timeout which occurs when the server (in this case Tomcat) does not respond quickly enough.

(I have been using ELBs for extremely high load services for many years, and do not agree with the answer to the link to other SO answer. While it is technically true, and may be true with extremely high bursting rates like thousands of requests in a second, unless your volume is this high, I would look at your application, first.)

The most obvious test to confirm it's not the ELB is to test requests directly against one of the Tomcat servers in your cluster. If you cannot route to the Tomcat instances, you could try curl to localhost from the instance you want to test.

Note also that there is a Health Check setting for ELBs and these allow you to set certain rules defining whether the server is healthy -- if not, the ELB will remove it from the cluster until it is healthy again. Health can include timely response. Look at CloudWatch for the ELB to see if there have been unhealthy instances recently.

If you were seeing 504 in development, and now it's more frequent, I would guess that this is actually a load or performance issue. The most typical is that Java gets into some garbage collection thrashing issue due to a problem with the underlying application. Look at CloudWatch metrics for your EC2 instances to see if memory or CPU is high or spikey.

Upvotes: 0

Related Questions