user1410665
user1410665

Reputation: 759

How can I restart a failed PBS job in cluster (qsub)?

I'm running a PBS job (python) in the cluster using qsub command. I'm curious to know how can I restart the same job from the step where it failed? Any type of help will be highly appreciated.

Upvotes: 2

Views: 1269

Answers (1)

Thomas Kainrad
Thomas Kainrad

Reputation: 2830

Most likely, you cannot.

Restarting a job requires a checkpoint file.
For this, checkpointing support has to be explicitly configured on your HPC environment and then the job has to be submitted with additional command-line arguments.

See http://docs.adaptivecomputing.com/torque/3-0-5/2.6jobcheckpoint.php

Upvotes: 1

Related Questions