jamesfzhang
jamesfzhang

Reputation: 4473

Occasional Rack::Timeout::RequestTimeoutException on Heroku

I notice periods of 1-5 minutes where anywhere between 1 - 20% of requests yield to Rack::Timeout::RequestTimeoutException. This happens around once every couple of hours. There are no n+1 queries, and there aren't any missing indices AFAIK. We're on Standard-7 Postgres, with 120GB RAM and connections aren't maxed out yet. What are some other things I can pry around to see what the problems are? Thanks!

Here's an example where the request queue time spikes.

Sample logs:

source=DATABASE
sample#current_transaction=160483065.0
sample#db_size=35361812244.0bytes
sample#tables=29
sample#active-connections=60
sample#waiting-connections=0
sample#index-cache-hit-rate=0.99897
sample#table-cache-hit-rate=0.99893
sample#load-avg-1m=0.07375
sample#load-avg-5m=0.06
sample#load-avg-15m=0.05375
sample#read-iops=0
sample#write-iops=0
sample#memory-total=125650852.0kB
sample#memory-free=75423472.0kB
sample#memory-cached=46423528.0kB
sample#memory-postgres=485000.0kB

enter image description here

Upvotes: 2

Views: 1137

Answers (1)

Luke Exton
Luke Exton

Reputation: 3676

I'm assuming you are running multiple web dynos and no worker dynos. And that you are using a unicorn/Goliath server that is optimized to serve multiple concurrent connections?

You aren't seeing Postgres or ruby spikes. You are seeing queuing spikes. Without seeing your actual setup. You could be a victim of the random selection algorithm for the heroku routing.

Do you have any long running tasks you could push to a background worker? That other requests might be running behind. Or is there a request that is hanging, causing anything behind it to timeout. These might be hard to spot in logs itself.

A few solutions as outlined in the article above. Add a hard timeouts on requests. Forcing any long running requests to die. This will cause your logs to better show where the exact location of any error rather than just the knock on effects.

The graph you have can sometimes be a touch difficult to interpret based on sampling rates, especially without diving into correlated graphs on a dyno itself. Check out graphite to see metrics on a per dyno basis.

A few other things that can block a worker without you noticing it.

DNS queries. How do you look up your host names? For external services/database instances etc. this can be hard to spot and would likely show up under the ruby part of the graph. So might not be the issue here.

Connection pooling. Seems unlikely in this case since you have already ruled it out. But check the number of worker vs the connections available.

Upvotes: 1

Related Questions