user2454869
user2454869

Reputation: 105

Good strategy Multi-GPU handling with CPU threads, cuda context creation overhead

We have a multi-GPU framework (on windows) where one can specifiy 'jobs' (which specify also on which GPU they shall be done) which are then executed on a specific GPU. Currently, we have the approach that on startup of the framework we create one 'Worker-Thread' for each GPU which then waits for jobs to be processed. Specifically, we use the 'GPUWorker' class from https://devtalk.nvidia.com/search/more/sitecommentsearch/GPUworker/

It works nicely so far, but has some serious performance-related disadvantages:

So I am now thinking about 'better' strategies for that problem. My idea goes as following: For each new job which is 'launched', I create a new 'temporary' CPU thread. The CPU thread then sets the device number (via 'cudaSetDevice') of the GPU on which the work shall be done. I suppose at this time also (transparantly for me' a Cuda context is created. After seeting the correct device, the 'doWork' function of the job is executed by the CPU thread. Dependent on whether the job shall be done synchronous or asynchronous, a 'join' is done (waiting for the CPU thread for completion) or not.

I have now several questions:

Upvotes: 1

Views: 1620

Answers (1)

ArchaeaSoftware
ArchaeaSoftware

Reputation: 4422

Your first approach sounds more promising than the alternative that you are considering.

Creating CPU threads and initializing CUDA contexts is quite expensive, and it's difficult-to-impossible for you to make that operation faster. NVIDIA deliberately front-loads a lot of operations into the context creation process, so you don't get unexpected delays or failures due to a resource allocation failure.

Your best bet is to invest in asynchrony. Without CPU/GPU concurrency, you are definitely leaving performance on the table because you are not hiding the CPU overhead that's built into the CUDA driver.

Upvotes: 2

Related Questions