Reputation: 759
I'm running a PBS job (python) in the cluster using qsub command. I'm curious to know how can I restart the same job from the step where it failed? Any type of help will be highly appreciated.
Upvotes: 2
Views: 1269
Reputation: 2830
Most likely, you cannot.
Restarting a job requires a checkpoint file.
For this, checkpointing support has to be explicitly configured on your HPC environment and then the job has to be submitted with additional command-line arguments.
See http://docs.adaptivecomputing.com/torque/3-0-5/2.6jobcheckpoint.php
Upvotes: 1