CUDA concurrent execution

Question

I hope answering my question would not require a lot of time, because it is about my understanding of this topic.

So, the question is about block and grid sizes for concurrent kernels execution.

First, let me tell about my card: it is GeForce GTX TITAN, and here is some of it's characteristics, which I think are important in this question.

CUDA Capability Major/Minor version number: 3.5

Total amount of global memory: 6144 MBytes (6442123264 bytes)

(14) Multiprocessors, (192) CUDA Cores/MP: 2688 CUDA Cores

Warp size: 32

Maximum number of threads per multiprocessor: 2048

Maximum number of threads per block: 1024

Now, the main problem: I have a kernel(it performs sparse matrix multiplication, but it is not so important) and I want to launch it simultaneously(!) in several streams on one GPU, computing different matrixes multiplication. Please, notice again the simultaneous requirement - I want all the kernels start at one moment, and finish at the another(all of them!), so the solution when these kernels only partly overlap doesn't satisfy me. It is also very important that I want to maximize the number of parallel kernels, even if we lose some performance because of it.

Ok, let`s consider we already have the kernel and we want to specify it's grid and block sizes in in the best way.

Looking to the card characteristics we see it has 14 sm and capability 3.5, which allows to run 32 concurrent kernels. So, the conclusion I make here is that launching 28 concurrent kernels(two per each of 14 SM) would be the best decision. The first question - am I right here?

Now, again, we want to optimize each kernel's block and grid sizes. Ok, let's look to this characteristic:

Maximum number of threads per multiprocessor: 2048

I understand it this way: if we launch a kernel with 1024 threads and 2 blocks, these two blocks will be computed simultaneously. if we launch a kernel with 1024 threads and 4 blocks, then two pairs of block will be computed one after another. So, the next conclusion I make is that launching 28 kernels each one with 1024 threads would be also the best solution - because this is the only way when they can be executed simultaneously on each SM. The second question - am I right here? Or there is better solution how to get the simultaneous execution?

It would be very nice if you only say am I right or not, and I would be very grateful if you explain where I mistake or propose a better solution.

Thank you for reading this!

CUDA concurrent execution

Answers (1)

Related Questions