duhaime
duhaime

Reputation: 27611

Slurm: how many times will failed jobs be --requeue'd

I have a Slurm job array for which the job file includes a --requeue directive. Here is the full job file:

#!/bin/bash
#SBATCH --job-name=catsss
#SBATCH --output=logs/cats.log
#SBATCH --array=1-10000
#SBATCH --requeue
#SBATCH --partition=scavenge
#SBATCH --mem=32g
#SBATCH --time=24:00:00
#SBATCH --mail-type=FAIL
#SBATCH [email protected]
module load Langs/Python/3.4.3
python3 cats.py ${SLURM_ARRAY_TASK_ID} 'cats'

Several of the array values have restarted at least once. I would like to know, how many times will these jobs restart before they are finally cancelled by the scheduler? Will the restarts carry on indefinitely until a sysadmin manually cancels them, or do jobs like this have a maximum number of retries?

Upvotes: 3

Views: 4597

Answers (1)

Poshi
Poshi

Reputation: 5762

AFAIK, the jobs can be requeued in infinite occasions. You just decide if the job is prepared to be requeued or not. If not-requeue, then it will never be requeued. If requeue, then it will be requeued everytime the system decides it is needed (node failure, higher priority job preemption...).

The jobs keep restarting until they finish (successfully or not, but finished instead of interrupted).

Upvotes: 3

Related Questions