ziliang hu
ziliang hu

Reputation: 11

How to use multiple gpus on a cluster on the premise that someone else has submitted a task?

School has a GPU computing cluster with 8 GPUS on each node. And We use the SLURM task management system to manage tasks. SLURM system prescribed if there is a task on a GPU, then no new tasks will be assigned to this GPU.

For example: On node1, there are 8 TITAN XP GPUS, and no one submit task, so we can use all 8 GPUS. In this situation, I can use a simple c++/cuda code to use all of them, like this:

    for(int i = 0; i < 8; i++) {
        cudaSetDevice(i); 
        ......
    }

But almost situation is someone has submit task, they may only use one or two GPUs, like this. His task is running in second GPU.

If i submit my task, also use above simple code, it will generate an error:

CUDA error at optStream.cu:496 code=10(cudaErrorInvalidDevice) "cudaSetDevice(coreID)"

i don't know how to solve this situation, i don't want to check the idle GPU number and recompile program, it's too inefficient.

So i need some advice.

Upvotes: 1

Views: 468

Answers (1)

Sigi
Sigi

Reputation: 4926

SLURM should be correctly setting the CUDA_VISIBLE_DEVICES environment variable to the IDs of the GPUs allocated to your job (hint: echo this variable in the script: if it's not happening it must be fixed).

In your code you will need to use "all the available gpus", that doesn't mean to use all the physically available GPUs, but the ones listed in that environment variable.

Your code will be portable with:

int count;
cudaGetDeviceCount ( &count );
for(int i = 0; i < count; i++) {
    cudaSetDevice(i); 
    ......
}

Example: if CUDA_VISIBLE_DEVICES=2,3 than your code will run on GPUs 2,3 - but you will see them as device 0 and 1 in the code.

Upvotes: 2

Related Questions