ChrisJ
ChrisJ

Reputation: 2802

How to handle SIGTERM with resque-status in complex jobs

I've been using resque on Heroku, which will from time to time interrupt your jobs with a SIGTERM.

Thus far I've handled this with a simple:

def process(options)
  do_the_job
rescue Resque::TermException
  self.defer options
end

We've started using resque-status so that we can keep track of jobs, but the method above obviously breaks that as the job will show completed when actually it's been deferred to another job.

My current thinking is that instead of deferring the current job in resque, there needs to be another job that re-queues jobs that have failed due to SIGTERM.

The trick comes in that some jobs are more complicated:

def process(options)
  do_part1 unless options['part1_finished']
  options['part1_finished']
  do_part2
rescue Resque::TermException
  self.defer options
end

Simply removing the rescue and simply retrying those jobs would cause an exception when do_part1 gets repeated.

Upvotes: 0

Views: 422

Answers (1)

ChrisJ
ChrisJ

Reputation: 2802

Looking more deeply into how resque-status works, a possible work around is to go straight to resque for the re-queue using the same parameters that resque-status would use.

def process
  do_part1 unless options['part1_finished']
  options['part1_finished']
  do_part2
rescue Resque::TermException
  Resque.enqueue self.class, uuid, options
  raise DeferredToNewJob
end

Of course, this is undocumented so may be incompatible with future releases of resque-status.

There is a draw back: between that job failing and the new job picking it up, the status of the first job will be reported by resque-status. This is why I re-raise a new exception - otherwise the job status will show completed until the new worker picks up the old job, which may confuse processes that are watching and waiting for the job to finish.

By raising a new exception DeferredToNewJob, the job status will temporarily show failure, which is easier to work around at the front end, and the specific exception can be automatically cleared from the resque failure queue.

UPDATE

resque-status provides support for on_failure handler. If a method with this name is defined as an instance method on the class, we can make this even simpler

Here's my on_failure

def on_failure(e)
  if e.is_a? DeferredToNewJob
    tick('Waiting for new job')
  else
    raise e
  end
end

With this in place the job spends basically no time in the failed state for processes watching it's status. In addition, if resque-status finds this handler, then it won't raise the exception up to resque, so it won't get added to the failed queue.

Upvotes: 0

Related Questions