CUDA concurrent kernel execution with multiple kernels per stream

Question

Using different streams for CUDA kernels makes concurrent kernel execution possible. Therefore n kernels on n streams could theoretically run concurrently if the they are fitting into the hardware, right?

Now I'm facing the following problem: There are not n distinct kernels but n*m where m kernels need to be executed in order. For instance n=2 and m=3 would lead to the following execution scheme with streams:

Stream 1: <<>> <<>> <<>>
Stream 2: <<>> <<>> <<>>

My naive assumption is that the kernels x.0 and y.1 should execute concurrently (from a theoretic point of view) or at least not consecutively (from a practical point of view). But my measurements are showing me that this is not the case and it seems that consecutive execution is performed (i. e. K0.0, K1.0, K2.0, K0.1, K1.1, K2.1). The kernels itself are very small, so concurrent execution should not be a problem.

Now my approach would be to accomplish a kind of dispatching for making sure that the kernels are en-queued in an interleaved style into the scheduler on the GPU. But when dealing with a large number of streams / kernels this could do more harm than good.

Alright, coming straight to the point: What would be an appropriate (or at least different) approach to solve this situation?

Edit: Measurements are done by using CUDA events. I've measured the time that is needed to fully solve the computation, i. e. the GPU has to compute all n * m kernels. The assumption is: On fully concurrent kernel execution the execution time is roughly (ideally) 1/n times of the time that is needed to execute all kernels in order, whereby it must be possible that two or more kernels can be executed concurrently. I'm ensuring this by only using two distinct streams right now.

I can measure a clear difference regarding execution times between using the streams as described and dispatching kernels interleaved, i. e.:

Loop: i = 0 to m
    EnqueueKernel(Kernel i.1, Stream 1)
    EnqueueKernel(Kernel i.2, Stream 2)

versus

Loop: i = 1 to n
    Loop: j = 0 to m
        EnqueueKernel(Kernel j.i, Stream i)

The latter leads to a longer runtime.

Edit #2: Changed the Stream numbers to begin by 1 (instead of 0, see comments below).

Edit #3: Hardware is a NVIDIA Tesla M2090 (i. e. Fermi, compute capability 2.0)

harrism · Accepted Answer

On Fermi (aka Compute Capability 2.0) hardware it is best to interleave kernel launches to multiple streams rather than to launch all kernels to one stream, then the next stream, etc. This is because the hardware can immediately launch kernels to different streams if there are sufficient resources, whereas if subsequent launches are to the same stream there is often delay introduced, reducing concurrency. This is the reason that your first approach performs better, and this approach is the one you should choose.

Enabling profiling can also disable concurrency on Fermi, so be careful with that. Also, be careful about using CUDA events during your launch loop, as these can interfere -- best to time the whole loop using events as you are doing, for example.

CUDA concurrent kernel execution with multiple kernels per stream

Answers (1)

Related Questions