Reputation: 1109
I'm new to Rails and just deployed my first app to Heroku (on the free tier). I set up a free trial of New Relic and set availability monitoring to ping my /register/ URL every 1 minute. I'm running Rails 3.2.13 and Ruby 1.9.3.
My app has basically no users and no requests (2 requests per minute, mostly coming from NewRelic). I have no background services or external dependencies. The data model is simple and no queries are taking more than a few 100ms.
I'm getting complete outages lasting exactly 15 minutes every few hours.
Since Heroku only keeps 1500 lines of logs I don't have data for each incident, but here's the log for the 2nd blip (my graph is -0400, Heroku is UTC).
Full log: https://gist.github.com/jbinto/5495226/raw/ba61ec16d9655287466cfbb9328f59c0171b2df7/heroku.log
Summary
Is this PG::Error the cause of my problems, or merely a symptom? Some Googling shows a discussion about Postgres timeouts on the starter tier, and some admonishment not using the production tier: https://groups.google.com/forum/?fromgroups=#!topic/heroku/a6iviwAFgdY
More StackOverflow: Postgres + Heroku SSL SYSCALL error
And a Rails ticket about auto-reconnect: https://github.com/rails/rails/issues/9421
This looks to be a good clue, but nobody seems to have solved this. It seems there's some flakiness on Heroku's Postgres, and Rails<4 doesn't recover well from this.
Any ideas? I'll be opening a Heroku support ticket, but not sure if I'll get anywhere as a free user.
Upvotes: 2
Views: 496
Reputation: 22240
The answer is here:
T01:02:21.144033+00:00 heroku[router]: at=error code=H12 desc="Request timeout" method=HEAD path=/register host=www.puckpicks.ca fwd="50.18.57.7" dyno=web.1 connect=1ms service=30000ms status=503 bytes=0
You generally see this pattern where one long-running action starts hogging the queue for all further requests.
The Heroku router will drop a long-running request after 30 seconds, but the dyno behind it will continue processing the request until completion. The router is unaware of it, though, so it'll dispatch new requests to that busy dyno. This effect tends to compound, and you'll see queueing in New Relic and eventually H12 errors even for unrelated URLs, like static assets.
You may want to install something like rack-timeout, which will ensure that a long running request is dropped at the dyno-level as well. Specifically, rack-timeout raise a TimeoutError when that happens. https://github.com/kch/rack-timeout
With that in place, the compound effect is less likely to occur, but long-running actions still need to be addressed. New Relic is a great tool to provide the visibility into your app to identify the long-running actions. You can then optimize them and make sure they're able to finish within a reasonable time, I suggest keeping all requests under 500ms. If they're performing any inherently long tasks, you should try to offload those to a background worker.
If you have a higher traffic production application I also suggest that you use Unicorn if you're not already so your app can process concurrent requests. This will give you more concurrency, help cut back on queue time, and give you more overall performance out of each dyno.
Upvotes: 3