Reputation: 5201
I was using SLURM to use some computing cluster and it had the -ntasks
or -n
. I have obviously read the documentation for it (http://slurm.schedmd.com/sbatch.html):
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.
the specific part I do not understand what it means is:
run within the allocation will launch a maximum of number tasks and to provide for sufficient resources.
I have a few questions:
sbatch my_batch_job.sh
. Not sure what task means.-n, --ntasks=<number>
. However, I obviously tested it out in the cluster, ran a echo hello
with --ntask=9
and I expected sbatch would echo hello 9 times to stdout (which is collected in slurm-job_id.out
, but to my surprise, there was a single execution of my echo hello script Then what does this command even do? It seems it does nothing or at least I can't see whats suppose to be doing.I do know the -a, --array=<indexes>
option exists for multiple jobs. That is a different topic. I simply want to know what --ntasks
is suppose to do, ideally with an example so that I can test it out in the cluster.
Upvotes: 96
Views: 83135
Reputation: 3653
Tasks are processes that a job executes in parallel in one or more nodes. sbatch
allocates resources for your job, but even if you request resources for multiple tasks, it will launch your job script in a single process in a single node only. srun
is used to launch job steps from the batch script. --ntasks=N
instructs srun
to execute N copies of the job step.
For example,
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2
means that you want to run two processes in parallel, and have each process access two CPUs. sbatch
will allocate four CPUs for your job and then start the batch script in a single process. Within your batch script, you can create a parallel job step using
srun --ntasks=2 --cpus-per-task=2 step.sh
This will run two processes in parallel, both of them executing the step.sh
script. From the same job, you could also run
srun --ntasks=1 --cpus-per-task=4 step.sh
This would launch a single process that can access all the four GPUs (although it would issue a warning).
It's worth noting that within the allocated resources, your job script is free to do anything, and it doesn't have to use srun
to create job steps (but you need srun
to launch a job step in multiple nodes). For example, the following script will run both steps in parallel:
#!/bin/bash
#SBATCH --ntasks=1
step1.sh &
step2.sh &
wait
If you want to launch job steps using srun
and have two different steps run in parallel, then your job needs to allocate two tasks, and your job steps need to request only one task. You also need to provide the --exclusive
argument to srun
, for the job steps to use separate resources.
#!/bin/bash
#SBATCH --ntasks=2
srun --ntasks=1 --exclusive step1.sh &
srun --ntasks=1 --exclusive step2.sh &
wait
Upvotes: 11
Reputation: 1371
The --ntasks
parameter is useful if you have commands that you want to run in parallel within the same batch script.
This may be two separate commands separated by an &
or two commands used in a bash pipe (|
).
For example
Using the default ntasks=1
#!/bin/bash
#SBATCH --ntasks=1
srun sleep 10 &
srun sleep 12 &
wait
Will throw the warning:
Job step creation temporarily disabled, retrying
The number of tasks by default was specified to one, and therefore the second task cannot start until the first task has finished. This job will finish in around 22 seconds. To break this down:
sacct -j515058 --format=JobID,Start,End,Elapsed,NCPUS
JobID Start End Elapsed NCPUS
------------ ------------------- ------------------- ---------- ----------
515058 2018-12-13T20:51:44 2018-12-13T20:52:06 00:00:22 1
515058.batch 2018-12-13T20:51:44 2018-12-13T20:52:06 00:00:22 1
515058.0 2018-12-13T20:51:44 2018-12-13T20:51:56 00:00:12 1
515058.1 2018-12-13T20:51:56 2018-12-13T20:52:06 00:00:10 1
Here task 0 started and finished (in 12 seconds) followed by task 1 (in 10 seconds). To make a total user time of 22 seconds.
To run both of these commands simultaneously:
#!/bin/bash
#SBATCH --ntasks=2
srun --ntasks=1 sleep 10 &
srun --ntasks=1 sleep 12 &
wait
Running the same sacct command as specified above
sacct -j 515064 --format=JobID,Start,End,Elapsed,NCPUS
JobID Start End Elapsed NCPUS
------------ ------------------- ------------------- ---------- ----------
515064 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 2
515064.batch 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 2
515064.0 2018-12-13T21:34:08 2018-12-13T21:34:20 00:00:12 1
515064.1 2018-12-13T21:34:08 2018-12-13T21:34:18 00:00:10 1
Here the total job taking 12 seconds. There is no risk of jobs waiting for resources as the number of tasks has been specified in the batch script and therefore the job has the resources to run this many commands at once.
Each task inherits the parameters specified for the batch script. This is why --ntasks=1
needs to be specified for each srun task, otherwise each task uses --ntasks=2
and so the second command will not run until the first task has finished.
Another caveat of the tasks inheriting the batch parameters is if --export=NONE
is specified as a batch parameter. In this case --export=ALL
should be specified for each srun command otherwise environment variables set within the sbatch script are not inherited by the srun command.
Additional notes:
When using bash pipes, it may be necessary to specify --nodes=1 to prevent commands either side of the pipes running on separate nodes.
When using &
to run commands simultaneously, the wait
is vital. In this case, without the wait
command, task 0 would cancel itself, given task 1 completed successfully.
Upvotes: 90
Reputation: 419
The "--ntasks" options specifies how many instances of your command are executed. For a common cluster setup and if you start your command with "srun" this corresponds to the number of MPI ranks.
In contrast the option "--cpus-per-task" specify how many CPUs each task can use.
Your output surprises me as well. Have you launched your command in the script or via srun? Does you script look like:
#!/bin/bash
#SBATCH --ntasks=8
## more options
echo hello
This should always output only a single line, because the script is only executed on the submitting node not the worker.
If your script look like
#!/bin/bash
#SBATCH --ntasks=8
## more options
srun echo hello
srun causes the script to run your command on the worker nodes and as a result you should get 8 lines of hello.
Upvotes: 36