Tim Perry
Tim Perry

Reputation: 13256

Debugging a mysterious Ruby process crash

I have a Ruby process running continuously on an m3.medium EC2 instance, rendering content from a queue into S3. It dies occasionally (typically multiple times a day), with no obvious cause and no explanation.

The app is deployed by elasticbeanstalk, and started with an ebextensions script which runs ruby --verbose app.rb and pipes the err and out to files (originally without --verbose, but we've added that in the hope of more detail). After exit, there's nothing in either file indicating any error. The top-level of the app looks like:

loop do
  begin
    do_processing
  rescue Exception => e
    puts "Error! #{e}"
  end
end

so it's unlikely to be exiting by itself or from exceptions (I think).

The server remains running throughout and isn't running out of memory. Sometimes it crashes while at peak load (up to 100% of the CPU), but not always.

Are there Ruby tools available to get more information on why a process has quit? Is it possible I'm hitting some other EC2 or Ruby limit which shuts down the process? What can I do to get more information on what's happened here?

Upvotes: 1

Views: 845

Answers (1)

Tim Perry
Tim Perry

Reputation: 13256

We've now solved this: the problem in the end was an actual stack overflow, deep in our codebase. We had a decorator leak where we wrapped our logger in another logging decorator occasionally, and so logging got slower and slower, and eventually got too deep and crashed the app. Fixing that fixed many things.

Sounds plausible that stack overflows aren't catchable in any normal way, and fail to push any output to the log because there's no stack left, giving a silent Ruby crash. That explains this tidily, and fixing that has fixed this totally for us.

For future people trying to debug such things, if this isn't your answer: look into running your app with post-mortem debugging (http://bashdb.sourceforge.net/ruby-debug.html#Post_002dMortem-Debugging) so it opens the debugger on crash, and seeing where it ends up. We found our app many many layers deep in logging, and the problem quickly became obvious from there.

Upvotes: 1

Related Questions