Bo Qiang
Bo Qiang

Reputation: 789

PBS automatically restart failed jobs

I use PBS job arrays to submit a number of jobs. Sometimes a small number of jobs get screwed up and not been ran successfully. Is there a way to automatically detect the failed jobs and restart them?

Upvotes: 1

Views: 797

Answers (1)

clusterdude
clusterdude

Reputation: 623

pbs_server supports automatic_requeue_exit_code:

an exit code, defined by the admin, that tells pbs_server to requeue the job instead of considering it as completed. This allows the user to add some additional checks that the job can run meaningfully, and if not, then the job script exits with the specified code to be requeued.

There is also a provision for requeuing jobs in the case where the prologue fails (see the prologue/epilogue script documentation).

There are probably more sophisticated ways of doing this, but they would fall outside the realm of built-in Torque options.

Upvotes: 1

Related Questions