Reputation: 710
This is on Heroku, but as of now I don't think it is related to that specifically. We also use rack-timeout gem which should be killing processes after 30 seconds. Please see below why it looks like that is not happening in some cases.
We have a very huge pool of puma workers/threads, 5 workers and 20 threads per worker. At the time of this issue as mentioned, we had 5 heroku dynos running, so this would have been 500 total puma instances available for working requests. With this many, I find it very hard to believe that a few long running requests caused such a high number of 503s.
I set up puma metrics and added them to a custom logging thread, and here is an example of a snapshot where we had an uptick in 503s from this AM, where the puma backlog results are bizarre:
Oct 24 09:06:56 app app[web] info "source=web.1.worker.28 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:06:56 app app[web] info "source=web.1.worker.54 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:06:56 app app[web] info "source=web.1.worker.72 sample#puma.backlog=99 sample#puma.running=20"
Oct 24 09:06:56 app app[web] info "source=web.1.worker.90 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:06:57 app app[web] info "source=web.3.worker.18 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:06:57 app app[web] info "source=web.3.worker.29 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:06:57 app app[web] info "source=web.3.worker.40 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:06:57 app app[web] info "source=web.3.worker.51 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:06:57 app app[web] info "source=web.3.worker.62 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:01 app app[web] info "source=web.4.worker.18 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:01 app app[web] info "source=web.4.worker.29 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:01 app app[web] info "source=web.4.worker.40 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:01 app app[web] info "source=web.4.worker.51 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:01 app app[web] info "source=web.4.worker.62 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:01 app app[web] info "source=web.2.worker.18 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:01 app app[web] info "source=web.2.worker.34 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:07:01 app app[web] info "source=web.2.worker.50 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:07:01 app app[web] info "source=web.2.worker.66 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:01 app app[web] info "source=web.2.worker.82 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:07:05 app app[web] info "source=web.5.worker.18 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:05 app app[web] info "source=web.5.worker.29 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:05 app app[web] info "source=web.5.worker.47 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:05 app app[web] info "source=web.5.worker.69 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:05 app app[web] info "source=web.5.worker.74 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:06 app app[web] info "source=web.1.worker.18 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:07:06 app app[web] info "source=web.1.worker.28 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:07:06 app app[web] info "source=web.1.worker.54 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:07:06 app app[web] info "source=web.1.worker.72 sample#puma.backlog=99 sample#puma.running=20"
Oct 24 09:07:06 app app[web] info "source=web.1.worker.90 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:07:07 app app[web] info "source=web.3.worker.18 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:07 app app[web] info "source=web.3.worker.29 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:07 app app[web] info "source=web.3.worker.40 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:07:07 app app[web] info "source=web.3.worker.51 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:07 app app[web] info "source=web.3.worker.62 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:11 app app[web] info "source=web.2.worker.18 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:11 app app[web] info "source=web.2.worker.34 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:07:11 app app[web] info "source=web.2.worker.50 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:07:11 app app[web] info "source=web.2.worker.66 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:11 app app[web] info "source=web.2.worker.82 sample#puma.backlog=1 sample#puma.running=20"
Oct 24 09:07:11 app app[web] info "source=web.4.worker.18 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:11 app app[web] info "source=web.4.worker.29 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:11 app app[web] info "source=web.4.worker.40 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:11 app app[web] info "source=web.4.worker.51 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:11 app app[web] info "source=web.4.worker.62 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:15 app app[web] info "source=web.5.worker.18 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:15 app app[web] info "source=web.5.worker.29 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:15 app app[web] info "source=web.5.worker.47 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:15 app app[web] info "source=web.5.worker.69 sample#puma.backlog=0 sample#puma.running=20"
Oct 24 09:07:15 app app[web] info "source=web.5.worker.74 sample#puma.backlog=0 sample#puma.running=20"
If at one worker (For example web.1.worker.72
) has a backlog of >=1, why is puma putting more requests into it when as you can see we have other workers with no backlog?
Why would this happen for multiple minutes (Full logs show that worker 72 on dyno 1 had a backlog of >99 for over 3 minutes):
web.1.worker.72 sample#puma.backlog=99
Note it gets even worse:
Oct 24 09:08:26 app app[web] info "source=web.1.worker.72 sample#puma.backlog=153 sample#puma.running=20"
Then better finally:
Oct 24 09:08:36 app app[web] info "source=web.1.worker.72 sample#puma.backlog=68 sample#puma.running=20"
Then back to healthy:
Oct 24 09:08:46 app app[web] info "source=web.1.worker.72 sample#puma.backlog=0 sample#puma.running=20"
The 503s stopped at 9:08am and we got back to healthy, so it certainly was tied to this odd backlog event lasting multiple minutes. Does anyone have any ideas?
Seems that scaling will not help at all...
Upvotes: 1
Views: 1019
Reputation: 710
After speaking with a few folks, it turns out that this was a bug in puma that was fixed in version 3.12 (We were on 3.11 when this question was first written). After we upgraded to 4.2 I can confirm that it works as expected.
Upvotes: 0