Reputation: 758
I am struggling with a basic python script that uses Pytorch to print the CUDA devices on Slurm.
This is the output of sinfo
.
(ml) [s.1915438@sl2 pytorch_gpu_check]$ sinfo -o "%.10P %.5a %.10l %.6D %.6t %.20N %.10G"
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST GRES
compute* up 3-00:00:00 1 drain* scs0123 (null)
compute* up 3-00:00:00 1 down* scs0050 (null)
compute* up 3-00:00:00 120 alloc scs[0001-0009,0011-0 (null)
compute* up 3-00:00:00 1 down scs0010 (null)
developmen up 30:00 1 drain* scs0123 (null)
developmen up 30:00 1 down* scs0050 (null)
developmen up 30:00 120 alloc scs[0001-0009,0011-0 (null)
developmen up 30:00 1 down scs0010 (null)
gpu up 2-00:00:00 2 mix scs[2001-2002] gpu:v100:2
gpu up 2-00:00:00 2 idle scs[2003-2004] gpu:v100:2
accel_ai up 2-00:00:00 1 mix scs2041 gpu:a100:8
accel_ai up 2-00:00:00 4 idle scs[2042-2045] gpu:a100:8
accel_ai_d up 2:00:00 1 mix scs2041 gpu:a100:8
accel_ai_d up 2:00:00 4 idle scs[2042-2045] gpu:a100:8
accel_ai_m up 12:00:00 1 idle scs2046 gpu:1g.5gb
s_highmem_ up 3-00:00:00 1 mix scs0151 (null)
s_highmem_ up 3-00:00:00 1 idle scs0152 (null)
s_compute_ up 3-00:00:00 2 idle scs[3001,3003] (null)
s_compute_ up 1:00:00 2 idle scs[3001,3003] (null)
s_gpu_eng up 2-00:00:00 1 idle scs2021 gpu:v100:4
I've access to accel_ai partition.
This is the Python file I am trying to run.
(ml) [s.1915438@sl2 pytorch_gpu_check]$ cat gpu.py
import torch
print(torch.__version__)
print(f"Is available: {torch.cuda.is_available()}")
try:
print(f"Current Devices: {torch.cuda.current_device()}")
except :
print('Current Devices: Torch is not compiled for GPU or No GPU')
print(f"No. of GPUs: {torch.cuda.device_count()}")
And this is my bash file to submit the job.
(ml) [s.1915438@sl2 pytorch_gpu_check]$ cat check_gpu.sh
#!bin/bash
#SBATCH --nodes=1
#SBATCH --time=00:00:40
#SBATCH --ntasks=1
#SBATCH --job-name=gpu
#SBATCH --output=gpu.%j.out
#SBATCH --error=gpu.%j.err
#SBATCH --mem-per-cpu=10
#SBATCH --gres=gpu:1
#SBATCH --account=scs2045
#SBATCH --partition=accel_ai
module load CUDA/11.3
module load anaconda/3
source activate
conda activate ml
python gpu.py
This is what happends when I run the bash script to submit the job.
(ml) [s.1915438@sl2 pytorch_gpu_check]$ bash check_gpu.sh
1.11.0
Is available: False
Current Devices: Torch is not compiled for GPU or No GPU
No. of GPUs: 0
One thing I would like to make clear is that this Pytorch version comes with CUDA 11.3 from Pytorch's website.
Can anyone tell, what am I doing wrong? Also, here even I exclude these lines the output is the same.
module load CUDA/11.3
module load anaconda/3
source activate
conda activate ml
Upvotes: 0
Views: 3053
Reputation: 1058
As per your sinfo
, you have separate partitions with gpu access. You need to run your program on one of those. The job submission script can be modified as follows. You also need to specify gpu type using --gres
.
...
...
#SBATCH --partition=gpu
#SBATCH --gres=<Enter gpu type>
...
...
Upvotes: 1
Reputation: 758
There is a couple of blunders in my approach. In the job file, the first line should be #!/bin/bash
not #!bin/bash
.
Also, Slurm has a special command SBATCH
to submit your job file. So in order to run your job file, for example check_gpu.sh
, we should use sbatch check_gpu.sh
not bash check_gpu.sh
.
The reason I was getting the following output is that bash thinks #
is a comment.
(ml) [s.1915438@sl2 pytorch_gpu_check]$ bash check_gpu.sh
1.11.0
Is available: False
Current Devices: Torch is not compiled for GPU or No GPU
No. of GPUs: 0
Thus, only the following lines are executed from the job script.
module load CUDA/11.3
module load anaconda/3
source activate
conda activate ml
python gpu.py
After the correction, I ran the job script and it works as expected.
[s.1915438@sl1 pytorch_gpu_check]$ sbatch check_gpu.sh
Submitted batch job 7133028
[s.1915438@sl1 pytorch_gpu_check]$ cat gpu.7133029.out
1.11.0
Is available: True
Current Devices: 0
No. of GPUs: 1
GPU Name:NVIDIA A100-PCIE-40GB
[s.1915438@sl1 pytorch_gpu_check]$ cat gpu.7133029.err
Upvotes: 0