How to run multiple jobs on a GPU grid with CUDA using SLURM

Question

I've been working on speeding up processing time on a job using CUDA. Usually this would be fairly straightforward, however I've run into a rather interesting problem. We are using slurm to schedule our jobs, and through adding CUDA code and enabling the compilation of it, it has decreased individual job time by half. The issue arises when looking at the loading on the GPUs. Before enabling CUDA we have the possibility of running 6 jobs per node. However, after enabling CUDA we can only run 2 jobs per node - 1 on each GPU.

Initially, thinking there was something wrong with my submission script, I went and tried adding:

--ntasks-per-node=6

to the submission command.

This returns an error stating:

sbatch: error: Batch job submission failed: Requested node configuration is not available

This leads me to believe that my slurm.conf is not configured properly. Any help would be greatly appreciated. I can't exactly post the slurm.conf, but I can look at any settings and/or change them on suggestion.

Edit: accidentally hit enter when filling out tags before ready to submit question.

handfulofsharks · Accepted Answer

Turns out that we had a hidden gres=gpu:1 inside of our slurm.conf. Removing this allowed us (in our case limited for CPU load reasons) to submit up to six CUDA + OpenGL jobs to a node with one K80 GPU.

How to run multiple jobs on a GPU grid with CUDA using SLURM

Answers (1)

Related Questions