Reputation: 99
I hope answering my question would not require a lot of time, because it is about my understanding of this topic.
So, the question is about block and grid sizes for concurrent kernels execution.
First, let me tell about my card: it is GeForce GTX TITAN, and here is some of it's characteristics, which I think are important in this question.
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 6144 MBytes (6442123264 bytes)
(14) Multiprocessors, (192) CUDA Cores/MP: 2688 CUDA Cores
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Now, the main problem: I have a kernel(it performs sparse matrix multiplication, but it is not so important) and I want to launch it simultaneously(!) in several streams on one GPU, computing different matrixes multiplication. Please, notice again the simultaneous requirement - I want all the kernels start at one moment, and finish at the another(all of them!), so the solution when these kernels only partly overlap doesn't satisfy me. It is also very important that I want to maximize the number of parallel kernels, even if we lose some performance because of it.
Ok, let`s consider we already have the kernel and we want to specify it's grid and block sizes in in the best way.
Looking to the card characteristics we see it has 14 sm and capability 3.5, which allows to run 32 concurrent kernels. So, the conclusion I make here is that launching 28 concurrent kernels(two per each of 14 SM) would be the best decision. The first question - am I right here?
Now, again, we want to optimize each kernel's block and grid sizes. Ok, let's look to this characteristic:
Maximum number of threads per multiprocessor: 2048
I understand it this way: if we launch a kernel with 1024 threads and 2 blocks, these two blocks will be computed simultaneously. if we launch a kernel with 1024 threads and 4 blocks, then two pairs of block will be computed one after another. So, the next conclusion I make is that launching 28 kernels each one with 1024 threads would be also the best solution - because this is the only way when they can be executed simultaneously on each SM. The second question - am I right here? Or there is better solution how to get the simultaneous execution?
It would be very nice if you only say am I right or not, and I would be very grateful if you explain where I mistake or propose a better solution.
Thank you for reading this!
Upvotes: 0
Views: 439
Reputation: 152164
There are a number of questions on concurrent kernels already. You might search and review some of them. You must consider register usage, blocks, threads, and shared memory usage, amongst other things. Your question is not precisely answerable when you don't provide information about register usage or shared memory usage. Maximizing concurrent kernels is partly an occupancy question, so you should study that as well.
Nevertheless, you want to observe maximum concurrent kernels. As you've already pointed out, that is 32.
You have 14 SMs, each of which can have a maximum of 2048 threads. 14x2048/32 = 896 threads per kernel (ie. blocks * threads per block)
With a threadblock size of 128, that would be 7 blocks per kernel. 7 blocks * 32 kernels = 224 blocks total. When we divide this by 14 SMs we get 16 blocks per SM, which just happens to exactly match the spec limit.
So the above analysis, 32 kernels, 7 blocks per kernel, 128 threads per block, would be the extent of the analysis that could be done taking into account only the data you have provided.
If that does not work for you, I'd be sure to make sure I have addressed the requirements for concurrent execution and then focus on registers per thread or shared memory to see if those are limiters for "occupancy" in this case.
Honestly I don't hold out much hope for you witnessing the perfect scenario you describe, but have at it. I'd enjoy being surprised. FYI, if I were trying to do something like this, I would certainly try it on linux rather than windows, especially considering your card is a GeForce card subject to WDDM limitations under windows.
Your understanding seems flawed. Statements like this:
if we launch a kernel with 1024 threads and 2 blocks, these two blocks will be computed simultaneously. if we launch a kernel with 1024 threads and 4 blocks, then two pairs of block will be computed one after another
don't make sense to me. Blocks will be computed in whatever order the scheduler deems appropriate, but there is no rule that says two blocks will be computed simultaneously, but four blocks will be computed two by two.
Upvotes: 2