Reputation: 2691
I don't have a CUDA card yet, and I have to focus on OpenCL now. So... I think I better just ask..
1. Are kernels executed in the order I invoke them?
If I invoke A through stream 0, B through stream 1, C through stream 0, D through stream1, E through stream 0, is it ensured that the device sees the kernels in the order A, B, C, D, E?
If I invoke kernels A and B through stream 0, and then invoker C through stream 1, will B block C?Do I have to invoke them in an order A, C, B to allow C running concurrently with A and B?
2. Are there any stalls or penalties if I want kernels to run concurrently?
On AMD cards, inter-queue dependency seems to be very expensive (I may wrong. Actually I hope that I'm wrong. But just no one can tell me weither I'm right or not yet.) If I have kernels A, B, and C, while A and B are independent and C depends on A and B. On AMD cards, there will be a huge delay if I let D wait on A or B, which make synchronized execution much faster for almost all situations.
What I now understand is that a CUDA card only have one queue for computation. That is I can express dependency with the order I invoke kernels instead of events as on AMD cards. Will it be more efficent or even penalty free?
Upvotes: 0
Views: 83
Reputation: 2691
On newer devices kernels from different stream will be executed out-of-order. I behavior I described in the question would only happen in very old architectures.
A kernel will start executing as soon as possible. Invoking A and B in different streams with B waiting on A doesn't have any obvious difference from Invoking A and B in order in a same stream.
Upvotes: 0
Reputation: 6333
It depends on the command queue you created. If it is an in-order queue then they are executed in order, in the order you submitted them. If it is an out-of-order queue then the runtime might execute them out of order and perhaps even concurrently. It does not have to. Some devices or drivers don't support out-of-order queues and just treat them as in-order.
Managing an out-of-order command queue moves the dependency burden on the host application; you need to use event objects to build a dependency graph.
Another (I think easier) way to get concurrent execution is to use multiple (likely in-order) command queues. Put independent work in each, and the runtime is allowed to run kernels (one from each) concurrently. It doesn't have to, but if it can, it should.
Upvotes: 1