Reputation: 4114
I have written a CUDA program which already gets a speedup compared to a serial version of 40 (2600k vs GTX 780). Now I am thinking about using several streams for running several kernels parallel. Now my questions are: How can I measure the free resources on my GPU (because if I have no free resources on my GPU the use of streams would make no sense, am I right?), and in which case does the use of streams make sense?
If asked I can provide my code of course, but at the moment I think that it is not needed for the question.
Upvotes: 0
Views: 752
Reputation: 2445
If you specified no stream, the stream 0 is used. According to wikipedia (you may also find it in the cudaDeviceProp structure), your GTX 780 GPU has 12 streaming multiprocessors which means there could be an improvement if you use multiple streams. The asyncEngineCount property will tell you how many concurrent asynchronous memory copies can run.
The idea of using streams is to use an asyncmemcopy engine (aka DMA engine) to overlap kernel executions and device2host transfers. The number of streams you should use for best performance is hard to guess because it depends on the number of DMA engines you have, the number of SMs and the balance between synchronizations/amount of concurrency. To get an idea you can read this presentation (for instance slides 5,6 explain the idea very well).
Edit: I agree that using a profiler is needed as a first step.
Upvotes: -1
Reputation: 151849
Running kernels concurrently will only happen if the resources are available for it. A single kernel call that "uses up" the GPU will prevent other kernels from executing in a meaningful way, as you've already indicated, until that kernel has finished executing.
The key resources to think about initially are SMs, registers, shared memory, and threads. Most of these are also related to occupancy, so studying occupancy (both theoretical, i.e. occupancy calculator, as well as measured) of your existing kernels will give you a good overall view of opportunities for additional benefit through concurrent kernels.
In my opinion, concurrent kernels is only likely to show much overall benefit in your application if you are launching a large number of very small kernels, i.e. kernels that encompass only one or a small number of threadblocks, and which make very limited use of shared memory, registers, and other resources.
The best optimization approach (in my opinion) is analysis-driven optimization. This tends to avoid premature or possibly misguided optimization strategies, such as "I heard about concurrent kernels, I wonder if I can make my code run faster with it?" Analysis driven optimization starts out by asking basic utilization questions, using the profiler to answer those questions, and then focusing your optimization effort at improving metrics, such as memory utilization or compute utilization. Concurrent kernels, or various other techniques are some of the strategies you might use to address the findings from profiling your code.
You can get started with analysis-driven optimization with presentations such as this one.
Upvotes: 4