How to use multiple gpus on a cluster on the premise that someone else has submitted a task？

Question

School has a GPU computing cluster with 8 GPUS on each node. And We use the SLURM task management system to manage tasks. SLURM system prescribed if there is a task on a GPU, then no new tasks will be assigned to this GPU.

For example: On node1, there are 8 TITAN XP GPUS, and no one submit task, so we can use all 8 GPUS. In this situation, I can use a simple c++/cuda code to use all of them, like this:

    for(int i = 0; i < 8; i++) {
        cudaSetDevice(i); 
        ......
    }

But almost situation is someone has submit task, they may only use one or two GPUs, like this. His task is running in second GPU.

If i submit my task, also use above simple code, it will generate an error：

CUDA error at optStream.cu:496 code=10(cudaErrorInvalidDevice) "cudaSetDevice(coreID)"

i don't know how to solve this situation, i don't want to check the idle GPU number and recompile program, it's too inefficient.

So i need some advice.

How to use multiple gpus on a cluster on the premise that someone else has submitted a task？

Answers (1)

Related Questions