Mehran
Mehran

Reputation: 2051

How to access to GPUs on different nodes in a cluster with Slurm?

I have access to a cluster that's run by Slurm, in which each node has 4 GPUs.

I have a code that needs 8 gpus.

So the question is how can I request 8 gpus on a cluster that each node has only 4 gpus?

So this is the job that I tried to submit via sbatch:

#!/bin/bash
#SBATCH --gres=gpu:8              
#SBATCH --nodes=2               
#SBATCH --mem=16000M              
#SBATCH --time=0-01:00     

But then I get the following error:

sbatch: error: Batch job submission failed: Requested node configuration is not available    

Then I changed my the settings to this and submitted again:

#!/bin/bash
#SBATCH --gres=gpu:4              
#SBATCH --nodes=2               
#SBATCH --mem=16000M              
#SBATCH --time=0-01:00  
nvidia-smi

and the result shows only 4 gpus not 8.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 0000:03:00.0     Off |                    0 |
| N/A   32C    P0    31W / 250W |      0MiB / 12193MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 0000:04:00.0     Off |                    0 |
| N/A   37C    P0    29W / 250W |      0MiB / 12193MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  Off  | 0000:82:00.0     Off |                    0 |
| N/A   35C    P0    28W / 250W |      0MiB / 12193MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  Off  | 0000:83:00.0     Off |                    0 |
| N/A   33C    P0    26W / 250W |      0MiB / 12193MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Thanks.

Upvotes: 4

Views: 3214

Answers (2)

WhoCares
WhoCares

Reputation: 11

Job script: You are requesting 2 nodes with each of them 4 GPUs. Tolal 8 GPUs are assigned to you. You are running "nvidia-smi". nvidia-smi does not aware of SLURM nor MPI. It runs only on first node assigned to you. So it shows only 4 GPUs, result is normal. If you run GPU based engineering application like Ansys HFSS or CST, They can use all 8 GPUs.

Upvotes: 0

Bub Espinja
Bub Espinja

Reputation: 4571

Slurm does not support what you need. It only can assign to your job GPUs/node, not GPUs/cluster. So, unlike CPUs or other consumable resources, GPUs are not consumable and are binded to the node where they are hosted.

If you are interested in this topic, there is a research effort to turn GPUs into consumable resources, check this paper. There you'll find how to do it using GPU virtualization technologies.

Upvotes: 2

Related Questions