Dave
Dave

Reputation: 91

Slurm: automatically requeue a job that reached wall-time limit

I am running a job test.sh that has cannot exceed a wall-time of 24h. Since the simulation will take >10 days, I would like to restart it automatically every time it reaches the wall-time limit. I would simply need to have it submit the same test.sh script every time.

I tried

jobid=$(sbatch --parsable test.sh)

scontrol update jobid $jobid dependency=after:$jobid

but the $jobid in scontrol update jobid $jobid is supposed to be a new job. Do you have suggestions? This may not be the way to achieve it..

Thank you for the help!

Upvotes: 9

Views: 3499

Answers (1)

damienfrancois
damienfrancois

Reputation: 59340

One easy way is to use the timeout command that will stop your program a bit before Slurm does, and will tell you through the return code if the timeout was reached or not. If so, you can requeue it with scontrol.

#!/bin/bash
#SBATCH --open-mode=append
#SBATCH --time=24:00:00
#... other Slurm options...

timeout 23h ./the_program 
if [[ $? == 124 ]]; then 
  scontrol requeue $SLURM_JOB_ID
fi

The --open-mode=append is there to make sure the output of each run is appended to the file chosen by --output rather than truncating it at every restart.

Upvotes: 14

Related Questions