akraf
akraf

Reputation: 3255

Why are my slurm job steps not launching in parallel?

I am trying to figure out what the concept of "tasks" means in SLURM. I have found this answer on SO that suggests me the following job script:

#!/bin/bash

#SBATCH --ntasks=2

srun --ntasks=1 sleep 10 & 
srun --ntasks=1 sleep 12 &
wait

The author says that this job runs for him in 12 seconds in total, because the two steps sleep 10 and sleep 12 run in parallel but I cannot reproduce that.

If I save the above file as slurm-test and run

sbatch -o slurm.out slurm-test,

I see that my job runs for 23 seconds.

This is the output of sacct --format=JobID,Start,End,Elapsed,NCPUS -S now-2minutes:

       JobID               Start                 End    Elapsed      NCPUS
------------ ------------------- ------------------- ---------- ----------
645514       2021-06-30T11:05:38 2021-06-30T11:06:00   00:00:22          2
645514.batch 2021-06-30T11:05:38 2021-06-30T11:06:00   00:00:22          2
645514.exte+ 2021-06-30T11:05:38 2021-06-30T11:06:00   00:00:22          2
645514.0     2021-06-30T11:05:38 2021-06-30T11:05:48   00:00:10          2
645514.1     2021-06-30T11:05:48 2021-06-30T11:06:00   00:00:12          2

My slurm.out output is

srun: Job 645514 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 645514

Explicitly incuding -n 2 in the sbatch call does not change the result. What am I doing wrong? How can I get the two calls in my job file to run simultaneously?

Upvotes: 5

Views: 4190

Answers (2)

Isabella
Isabella

Reputation: 401

For me, the reason for step creation temporarily disabled, retrying (Requested nodes are busy) is because, the srun command that executed first, allocated all the memory. To solve this, one first optionally(?) specify the total memory allocation in sbatch:

#SBATCH --ntasks=2
#SBATCH --mem=[XXXX]MB

And then specify the memory use per srun task:

srun --exclusive --ntasks=1 --mem-per-cpu [XXXX/2]MB sleep 10 & 
srun --exclusive --ntasks=1 --mem-per-cpu [XXXX/2]MB sleep 12 &
wait

I didn't specify cpu count for srun because in my sbatch script I have #SBATCH --cpus-per-task=1. For the same reason I suspect you should use --mem instead of --mem-per-cpu in the srun command when your job isn't serial, but I haven't tested this configuration.

Upvotes: 4

damienfrancois
damienfrancois

Reputation: 59340

Depending on the Slurm version you might have to add the --exclusive parameter to srun (which has different semantics than for sbatch):

#!/bin/bash

#SBATCH --ntasks=2

srun --ntasks=1 --exclusive -c 1 sleep 10 & 
srun --ntasks=1 --exclusive -c 1 sleep 12 &
wait

Also adding -c 1 to be more explicit might help, again depending on the Slurm version.

Upvotes: 4

Related Questions