Clej
Clej

Reputation: 466

Do I need a single bash file for each task in SLURM?

I am trying to launch several task in a SLURM-managed cluster, and would like to avoid dealing with dozens of files. Right now, I have 50 tasks (subscripted i, and for simplicity, i is also the input parameter of my program), and for each one a single bash file slurm_run_i.sh which indicates the computations configuration, and the srun command:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1 
#SBATCH -J pltCV
#SBATCH --mem=30G

srun python plotConvergence.py i

I am then using another bash file to submit all these tasks, slurm_run_all.sh

#!/bin/bash
for i in {1..50}:
  sbatch slurm_run_$i.sh 
done

This works (50 jobs are running on the cluster), but I find it troublesome to have more than 50 input files. Searching a solution, I came up with the & command, obtaining something as:

#!/bin/bash

#SBATCH --ntasks=50
#SBATCH --cpus-per-task=1 
#SBATCH -J pltall
#SBATCH --mem=30G

# Running jobs 
srun python plotConvergence.py 1   &
srun python plotConvergence.py 2   & 
...
srun python plotConvergence.py 49  & 
srun python plotConvergence.py 50  & 
wait
echo "All done"

Which seems to run as well. However, I cannot manage each of these jobs independently: the output of squeue shows I have a single job (pltall) running on a single node. As there are only 12 cores on each node in the partition I am working in, I am assuming most of my jobs are waiting on the single node I've been allocated to. Setting the -N option doesn't change anything too.. Moreover, I cannot cancel some jobs individually anymore if I realize there's a mistake or something, which sounds problematic to me.

Is my interpretation right, and is there a better way (I guess) than my attempt to process several jobs in slurm without being lost among many files ?

Upvotes: 1

Views: 442

Answers (1)

damienfrancois
damienfrancois

Reputation: 59110

What you are looking for is the jobs array feature of Slurm.

In your case, you would have a single submission file (slurm_run.sh) like this:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1 
#SBATCH -J pltCV
#SBATCH --mem=30G
#SBATCH --array=1-50

srun python plotConvergence.py ${SLURM_ARRAY_TASK_ID}

and then submit the array of jobs with

sbatch slurm_run.sh

You will see that you will have 50 jobs submitted. You can cancel all of them at once or one by one. See the man page of sbatch for details.

Upvotes: 1

Related Questions