Reputation: 163
I have access to a large GPU cluster (20+ nodes, 8 GPUs per node) and I want to launch a task several times on n
GPUs (1 per GPU, n
> 8) within one single batch without booking full nodes with the --exclusive
flag.
I managed to pre-allocate the resources (see below), but I struggle very hard with launching the task several times within the job. Specifically, my log shows no value for the CUDA_VISIBLE_DEVICES variable.
I know how to do this operation on fully booked nodes with the --nodes
and --gres
flags. In this situation, I use --nodes=1
--gres=gpu:1
for each srun
. However, this solution does not work for the present question, the job hangs indefinitely.
In the MWE below, I have a job asking for 16 gpus (--ntasks
and --gpus-per-task
). The jobs is composed of 28 tasks which are launched with the srun
command.
#!/usr/bin/env bash
#SBATCH --job-name=somename
#SBATCH --partition=gpu
#SBATCH --nodes=1-10
#SBATCH --ntasks=16
#SBATCH --gpus-per-task=1
for i in {1..28}
do
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
done
wait
The output of this script should look like this:
nodeA 1
nodeR 2
...
However, this is what I got:
nodeA
nodeR
...
Upvotes: 3
Views: 2003
Reputation: 59260
When you write
srun echo $(hostname) $CUDA_VISIBLE_DEVICES &
the expansion of the $CUDA_VISIBLE_DEVICES
variable will be performed on the master node of the allocation (where the script is run) rather than on the node targeted by srun
. You should escape the $
:
srun echo $(hostname) \$CUDA_VISIBLE_DEVICES &
By the way, the --gpus-per-task=
appeared in the sbatch
manpage in the 19.05 version. When you use it with an earlier option, I am not sure how it goes.
Upvotes: 1