Reputation: 3267
I have a kernel that runs on my GPU (GeForce 690) and uses a single block. It runs in about 160
microseconds. My plan is to launch 8
of these kernels separately, each of which only uses a single block, so each would run on a separate SM, and then they'd all run concurrently, hopefully in about 160
microseconds.
However, when I do that, the total time increases linearly with each kernel: 320
microseconds if I run 2
kernels, about 490
microseconds for 3
kernels, etc.
My question: Do I need to set any flag somewhere to get these kernels to run concurrently? Or do I have to do something that isn't obvious?
Upvotes: 1
Views: 1242
Reputation: 152259
As @JackOLantern indicated concurrent kernels require the usage of streams, which are required for all forms of asynchronous activity scheduling on the GPU. It also requires a GPU of compute capability 2.0 or greater, generally speaking. If you do not use streams in your application, all cuda API and kernel calls will be executed sequentially, in the order in which they were issued in the code, with no overlap from one call/kernel to the next.
Rather than give a complete tutorial here, please review the concurrent kernels cuda sample that JackOlantern referenced.
Also note that actually witnessing concurrent execution can be more difficult on windows, for a variety of reasons. If you run the concurrent kernels sample, it will indicate pretty quickly if the environment you are in (OS, driver, etc.) is providing concurrent execution.
Upvotes: 4