Prakhar Sharma
Prakhar Sharma

Reputation: 758

How to run Pytorch script on Slurm?

I am struggling with a basic python script that uses Pytorch to print the CUDA devices on Slurm.

This is the output of sinfo.

(ml) [s.1915438@sl2 pytorch_gpu_check]$ sinfo -o "%.10P %.5a %.10l %.6D %.6t %.20N %.10G"
 PARTITION AVAIL  TIMELIMIT  NODES  STATE             NODELIST       GRES
  compute*    up 3-00:00:00      1 drain*              scs0123     (null)
  compute*    up 3-00:00:00      1  down*              scs0050     (null)
  compute*    up 3-00:00:00    120  alloc scs[0001-0009,0011-0     (null)
  compute*    up 3-00:00:00      1   down              scs0010     (null)
developmen    up      30:00      1 drain*              scs0123     (null)
developmen    up      30:00      1  down*              scs0050     (null)
developmen    up      30:00    120  alloc scs[0001-0009,0011-0     (null)
developmen    up      30:00      1   down              scs0010     (null)
       gpu    up 2-00:00:00      2    mix       scs[2001-2002] gpu:v100:2
       gpu    up 2-00:00:00      2   idle       scs[2003-2004] gpu:v100:2
  accel_ai    up 2-00:00:00      1    mix              scs2041 gpu:a100:8
  accel_ai    up 2-00:00:00      4   idle       scs[2042-2045] gpu:a100:8
accel_ai_d    up    2:00:00      1    mix              scs2041 gpu:a100:8
accel_ai_d    up    2:00:00      4   idle       scs[2042-2045] gpu:a100:8
accel_ai_m    up   12:00:00      1   idle              scs2046 gpu:1g.5gb
s_highmem_    up 3-00:00:00      1    mix              scs0151     (null)
s_highmem_    up 3-00:00:00      1   idle              scs0152     (null)
s_compute_    up 3-00:00:00      2   idle       scs[3001,3003]     (null)
s_compute_    up    1:00:00      2   idle       scs[3001,3003]     (null)
s_gpu_eng    up 2-00:00:00      1   idle              scs2021 gpu:v100:4

I've access to accel_ai partition.

This is the Python file I am trying to run.

(ml) [s.1915438@sl2 pytorch_gpu_check]$ cat gpu.py 
import torch
print(torch.__version__)
print(f"Is available: {torch.cuda.is_available()}")

try:
    print(f"Current Devices: {torch.cuda.current_device()}")
except :
    print('Current Devices: Torch is not compiled for GPU or No GPU')

print(f"No. of GPUs: {torch.cuda.device_count()}")

And this is my bash file to submit the job.

(ml) [s.1915438@sl2 pytorch_gpu_check]$ cat check_gpu.sh 
#!bin/bash
#SBATCH --nodes=1
#SBATCH --time=00:00:40
#SBATCH --ntasks=1
#SBATCH --job-name=gpu
#SBATCH --output=gpu.%j.out
#SBATCH --error=gpu.%j.err
#SBATCH --mem-per-cpu=10
#SBATCH --gres=gpu:1
#SBATCH --account=scs2045
#SBATCH --partition=accel_ai

module load CUDA/11.3
module load anaconda/3
source activate
conda activate ml
python gpu.py

This is what happends when I run the bash script to submit the job.

(ml) [s.1915438@sl2 pytorch_gpu_check]$ bash check_gpu.sh 
1.11.0
Is available: False
Current Devices: Torch is not compiled for GPU or No GPU
No. of GPUs: 0

One thing I would like to make clear is that this Pytorch version comes with CUDA 11.3 from Pytorch's website.

Can anyone tell, what am I doing wrong? Also, here even I exclude these lines the output is the same.

module load CUDA/11.3
module load anaconda/3
source activate
conda activate ml

Upvotes: 0

Views: 3053

Answers (2)

ravikt
ravikt

Reputation: 1058

As per your sinfo, you have separate partitions with gpu access. You need to run your program on one of those. The job submission script can be modified as follows. You also need to specify gpu type using --gres.

...
...
#SBATCH --partition=gpu
#SBATCH --gres=<Enter gpu type>
...
...

Upvotes: 1

Prakhar Sharma
Prakhar Sharma

Reputation: 758

There is a couple of blunders in my approach. In the job file, the first line should be #!/bin/bash not #!bin/bash.

Also, Slurm has a special command SBATCH to submit your job file. So in order to run your job file, for example check_gpu.sh, we should use sbatch check_gpu.sh not bash check_gpu.sh.

The reason I was getting the following output is that bash thinks # is a comment.

(ml) [s.1915438@sl2 pytorch_gpu_check]$ bash check_gpu.sh 
1.11.0
Is available: False
Current Devices: Torch is not compiled for GPU or No GPU
No. of GPUs: 0

Thus, only the following lines are executed from the job script.

module load CUDA/11.3
module load anaconda/3
source activate
conda activate ml
python gpu.py

After the correction, I ran the job script and it works as expected.

[s.1915438@sl1 pytorch_gpu_check]$ sbatch check_gpu.sh 
Submitted batch job 7133028
[s.1915438@sl1 pytorch_gpu_check]$ cat gpu.7133029.out 
1.11.0
Is available: True
Current Devices: 0
No. of GPUs: 1
GPU Name:NVIDIA A100-PCIE-40GB
[s.1915438@sl1 pytorch_gpu_check]$ cat gpu.7133029.err

Upvotes: 0

Related Questions