Reputation: 430
The scenario is this one, I allocate ressources (2 nodes, 64 CPUs) to job with salloc:
salloc -N 1-2 -n 64 -c 1 -w cluster-node[2-3] -m cyclic -t 5
salloc: Granted job allocation 1720
Then, I use srun
to create steps to my job:
for i in (seq 70)
srun --exclusive -N 1 -n 1 --jobid=1720 sleep 60 &
end
Because I created more steps than available cpus for my job, steps are "pending" until a free CPU.
When I use squeue
with -s option to list steps, I'm only able to view the running ones.
squeue -s -O stepid:12,stepname:10,stepstate:9
1720.0 sleep RUNNING
[...]
1720.63 sleep RUNNING
My question is, does steps have status different from RUNNING like jobs, and if yes, is there a way to view those with squeue (or other command) ?
Upvotes: 1
Views: 254
Reputation: 59360
Not sure Slurm can offer the information. One alternative would be to use GNU Parallel so that jobs steps are not started at all until a CPU is available. In the current setting all job steps are started at once and those which do not have a CPU available are waiting.
So with the same allocation as you use, replace
for i in (seq 70)
srun --exclusive -N 1 -n 1 --jobid=1720 sleep 60 &
end
with
parallel -P $SLURM_NTASKS srun --exclusive -N 1 -n 1 --jobid=1720 sleep 60
Then the output of squeue should list RUNNING and PENDING steps.
N.B. not sure the --jobid=
option is needed here BTW
Upvotes: 1