When does using more than one stream gain benefit in CUDA?

Question

I have written a CUDA program which already gets a speedup compared to a serial version of 40 (2600k vs GTX 780). Now I am thinking about using several streams for running several kernels parallel. Now my questions are: How can I measure the free resources on my GPU (because if I have no free resources on my GPU the use of streams would make no sense, am I right?), and in which case does the use of streams make sense?
If asked I can provide my code of course, but at the moment I think that it is not needed for the question.

Robert Crovella · Accepted Answer

Running kernels concurrently will only happen if the resources are available for it. A single kernel call that "uses up" the GPU will prevent other kernels from executing in a meaningful way, as you've already indicated, until that kernel has finished executing.

The key resources to think about initially are SMs, registers, shared memory, and threads. Most of these are also related to occupancy, so studying occupancy (both theoretical, i.e. occupancy calculator, as well as measured) of your existing kernels will give you a good overall view of opportunities for additional benefit through concurrent kernels.

In my opinion, concurrent kernels is only likely to show much overall benefit in your application if you are launching a large number of very small kernels, i.e. kernels that encompass only one or a small number of threadblocks, and which make very limited use of shared memory, registers, and other resources.

The best optimization approach (in my opinion) is analysis-driven optimization. This tends to avoid premature or possibly misguided optimization strategies, such as "I heard about concurrent kernels, I wonder if I can make my code run faster with it?" Analysis driven optimization starts out by asking basic utilization questions, using the profiler to answer those questions, and then focusing your optimization effort at improving metrics, such as memory utilization or compute utilization. Concurrent kernels, or various other techniques are some of the strategies you might use to address the findings from profiling your code.

You can get started with analysis-driven optimization with presentations such as this one.

When does using more than one stream gain benefit in CUDA?

Answers (2)

Related Questions