Reputation: 438
I was wondering, if I run a kernel with 10 blocks of 1000 threads in one stream to analyse an array of data, and then launch a kernel that requires 10 blocks of 1000 threads to analyse another array in a second stream, what is going to happen?
Are the un-active threads on my card going to begin the process of analysing my second array ? or is the second stream going to be paused until the first stream will have to finish ?
Thank you.
Upvotes: 0
Views: 1378
Reputation: 152173
Generally speaking, if the kernels are issued from different (non-default) streams of the same application, and all requirements for execution of concurrent kernels are met, and there are enough resources available (SMs, especially -- I guess this is what you mean by "un-active threads") to schedule both kernels, then some of the blocks of the second kernel will begin executing along side of the blocks of the first kernel that are already executing. This may occur on the same SMs that the blocks of the first kernel are already scheduled on, or it may occur on other, unoccupied SMs, or both (for example if your GPU has 14 SMs, the work distributor would distribute the 10 blocks of the first kernel on 10 of the SMs, leaving 4 that are unused at that point.)
If on the other hand, your kernels had threadblocks requiring 32KB of shared memory usage, and your GPU had 8 SMs, then the threadblocks of the first kernel would effectively "use up" the 8 SMs, and the threadblocks of the second kernel would not begin executing until some of the threadblocks of the first kernel had "drained" i.e. completed and been retired. That's just one example of resource utilization that could inhibit concurrent execution. And of course, if you were launching kernels with many threadblocks each (say 100 or more) then the first kernel would mostly occupy the machine, and the second kernel would not begin executing until the first kernel had largely finished.
If you search in the upper right hand corner on "cuda concurrent kernels" you'll find a number of questions that highlight some of the challenges associated with observing concurrent kernel execution.
Upvotes: 1