Garrett Motzner
Garrett Motzner

Reputation: 3260

How to recover a critical python job from system failure

Is there any python library that would provide a (generic) job state journaling and recovery functionality?

Here's my use case:

  1. data received to start a job
  2. job starts processing
  3. job finishes processing

I then want to be able to restart a job back after 1 if the process aborts / power fails. Jobs would write to a journal file when job starts, and mark the job done when the job completes. So when the process starts, it checks a journal file for uncompleted jobs, and uses the journal data to restart the job(s) that did not complete, if present. So what python tools exist to solve this? (Or other python solutions to having fault tolerance and recovery for critical jobs that must complete). I know a job queue like RabbitMQ would work quite well for this case, but I want a solution that doesn't need an external service. I searched PyPI for "journaling" and didn't get much. So any solutions? Seems like a library for this would be useful, since there are multiple concerns when using a journal that are hard to get right, but a library could handle. (Such as multiple async writes, file splitting and truncating, etc.)

Upvotes: 1

Views: 420

Answers (1)

Mohamed Belkheir
Mohamed Belkheir

Reputation: 179

I think you can do this using either crontabs or APScheduler, I think the latter has all the feature you need, but even with cron you can do something like:

1: schedule A process to run after a specific interval

2: A process checks if there is a running job or not

3: if no job is running, start one

4: job continues working, and saves state into drive/db

5: if it fails or finishes, step 3 will continue

APScheduler is likely what you're looking for, their feature list is extensive and it's also extendable if it doesn't fulfill your requirements.

Upvotes: 1

Related Questions