Reputation: 91
I am running a job test.sh that has cannot exceed a wall-time of 24h. Since the simulation will take >10 days, I would like to restart it automatically every time it reaches the wall-time limit. I would simply need to have it submit the same test.sh script every time.
I tried
jobid=$(sbatch --parsable test.sh)
scontrol update jobid $jobid dependency=after:$jobid
but the $jobid in scontrol update jobid $jobid is supposed to be a new job. Do you have suggestions? This may not be the way to achieve it..
Thank you for the help!
Upvotes: 9
Views: 3499
Reputation: 59340
One easy way is to use the timeout
command that will stop your program a bit before Slurm does, and will tell you through the return code if the timeout was reached or not. If so, you can requeue it with scontrol
.
#!/bin/bash
#SBATCH --open-mode=append
#SBATCH --time=24:00:00
#... other Slurm options...
timeout 23h ./the_program
if [[ $? == 124 ]]; then
scontrol requeue $SLURM_JOB_ID
fi
The --open-mode=append
is there to make sure the output of each run is appended to the file chosen by --output
rather than truncating it at every restart.
Upvotes: 14