Reputation: 105
We have a multi-GPU framework (on windows) where one can specifiy 'jobs' (which specify also on which GPU they shall be done) which are then executed on a specific GPU. Currently, we have the approach that on startup of the framework we create one 'Worker-Thread' for each GPU which then waits for jobs to be processed. Specifically, we use the 'GPUWorker' class from https://devtalk.nvidia.com/search/more/sitecommentsearch/GPUworker/
It works nicely so far, but has some serious performance-related disadvantages:
In our frameowrk, a specific GPU is locked for the whole time of a 'job', even if the GPU is actually used only in 50 % of the time of the job. Note the jobs have a very coarse granurality, e.g. 'do optical flow calculation', which can take e.g. 50 - 100 milliseconds.
One can not specific 'asynchronous' jobs (e.g. an aysnchronous host-device copy) which do not lock the GPU
So I am now thinking about 'better' strategies for that problem. My idea goes as following: For each new job which is 'launched', I create a new 'temporary' CPU thread. The CPU thread then sets the device number (via 'cudaSetDevice') of the GPU on which the work shall be done. I suppose at this time also (transparantly for me' a Cuda context is created. After seeting the correct device, the 'doWork' function of the job is executed by the CPU thread. Dependent on whether the job shall be done synchronous or asynchronous, a 'join' is done (waiting for the CPU thread for completion) or not.
I have now several questions:
Is that a 'good' strategy, or does somebody know of a better way how to handle this ? Of course it must be a thread-safe strategy.
In my proposed strategy, what is the typical overhead (in milliseconds) of the creation of the new CPU thread and the (hidden) creation of the Cuda context) ? Furthermore, if e.g. the creation of the Cuda context is signficiant, is there a way (e.g. using the cuda device api and some sort of 'context migration') to reduce this overhead ?
Upvotes: 1
Views: 1620
Reputation: 4422
Your first approach sounds more promising than the alternative that you are considering.
Creating CPU threads and initializing CUDA contexts is quite expensive, and it's difficult-to-impossible for you to make that operation faster. NVIDIA deliberately front-loads a lot of operations into the context creation process, so you don't get unexpected delays or failures due to a resource allocation failure.
Your best bet is to invest in asynchrony. Without CPU/GPU concurrency, you are definitely leaving performance on the table because you are not hiding the CPU overhead that's built into the CUDA driver.
Upvotes: 2