Reputation: 2079
I have realized that the jobs submitted with a previous version of my software are useless because of a bug, so I want to cancel them. However, I also have newer jobs that I would like to keep running. All the jobs have the same job name and are running in the same partition.
I have written the following script to cancel the jobs with an ID lower than a given one.
#!\bin\bash
if [ $1 ]
then
MAX_JOBID=$1
else
echo "An integer value is needed"
exit
fi
JOBIDLIST=$(squeue -u $USER -o "%F")
for JOBID in $JOBIDLIST
do
if [ "$JOBID" -lt "$MAX_JOBID" ]
then
echo "Cancelling job "$JOBID
scancel $JOBID
fi
done
I would say that this is a recurrent situation for someone developing a software and I wonder if there is a direct way to do it using slurm commands. Alternatively, do you use some tricks like appending the software commit ID to the job name to overcome this kind of situations?
Upvotes: 1
Views: 849
Reputation: 59360
In addition to the suggestions by @j23, you can organise your jobs with
job arrays ; if all your jobs are similar in terms of submission script, make them a job array, and submit one job array per version of your software. Then you can cancel an entire job array with just one scancel
command
a workflow management system ; they enable submitting and managing sets of jobs (possibly on different clusters) easily
Upvotes: 1
Reputation: 3530
Unfortunately there is no direct way to cancel the job in such scenarios.
Alternatively, like you pointed out, naming the job by adding software version/commit along with job name is useful. In that case you can use, scancel --name=JOB_NAME_VERSION
to cancel all the jobs with that job name.
Also, if newly submitted jobs can be hold
using scontrol hold <jobid>
and then all the PENDING
job can be cancelled using scancel --state=PENDING
In my case, I used a similar approach (like yours) by having squeue
piped the output to awk
and cancelled the first N number of jobs I wanted to remove. Its a one-liner script.
Something like this:
eg: squeue arguments | awk 'NR>=2 && NR<=N{print $1}' | xargs /usr/bin/scancel
Upvotes: 3