Kernels not running concurrently in CUDA

Question

I have a kernel that runs on my GPU (GeForce 690) and uses a single block. It runs in about 160 microseconds. My plan is to launch 8 of these kernels separately, each of which only uses a single block, so each would run on a separate SM, and then they'd all run concurrently, hopefully in about 160 microseconds.

However, when I do that, the total time increases linearly with each kernel: 320 microseconds if I run 2 kernels, about 490 microseconds for 3 kernels, etc.

My question: Do I need to set any flag somewhere to get these kernels to run concurrently? Or do I have to do something that isn't obvious?

Robert Crovella · Accepted Answer

As @JackOLantern indicated concurrent kernels require the usage of streams, which are required for all forms of asynchronous activity scheduling on the GPU. It also requires a GPU of compute capability 2.0 or greater, generally speaking. If you do not use streams in your application, all cuda API and kernel calls will be executed sequentially, in the order in which they were issued in the code, with no overlap from one call/kernel to the next.

Rather than give a complete tutorial here, please review the concurrent kernels cuda sample that JackOlantern referenced.

Also note that actually witnessing concurrent execution can be more difficult on windows, for a variety of reasons. If you run the concurrent kernels sample, it will indicate pretty quickly if the environment you are in (OS, driver, etc.) is providing concurrent execution.

Kernels not running concurrently in CUDA

Answers (1)

Related Questions