Reputation: 93
In the cuda C programming guide, stream is defined very abstractly: a sequence of cuda operations that are executed in order they are issued by the code.
My understanding of how instructions are executed in Nvidia GPU is: when a kernel is launched, the blocks are distributed to SMs in the device. Then the warps ( groups of 32 threads ) are schedueled by a warp schedueler in the SM for instructions to be processed warp-wise.
So, if two kernels are launched in the same stream, then the first is processed before the second ( since the instructions are processed in the order they are put in the stream ). Does that mean two kernels end up only using hardware resource of one kernel? Or does each kernel have their own resources, but the second one is pending until the first is complete?
And in general, how are streams implemented in hardware? I assume it provides ordering to the warp scheduler ( but then a warp scheduler is per-SM based, so how would this allow multi-SM kernels to use stream?).
Upvotes: 3
Views: 2353
Reputation: 21818
A CUDA stream is merely a queue of actions to be performed by the GPU. Every function through API can be issued in an asynchronous way - the CPU code continues while the instruction waits to be executed independently from the host code. Still, it is executed sychronously with respect to other instructions in the queue/stream.
If you want multiple operations on the GPU to be executed asynchronously, you need two or more queues/streams. For example, there is a chapter in the CUDA manual on how to mix kernel execution (first stream) with memory transfers (second stream).
Upvotes: 4