Reputation: 6160
Guys please confirm if my thinking is right.
Suppose I have a kernel function mykernel(double *array)
. I want the code inside the kernel to be executed 128-times. I can do this in two ways when invoking the kernel from host:
mykernel<<<128, 1>>>(myarray);
//or
mykernel<<<1, 128>>>(myarray);
With first invocation I will create 128 blocks, each running 1 thread. In second invocation I will create 1 block with 128 threads. But since the code inside kernel is working on the same array, it is more efficient to use the second invocation.
Am I a fool or should I stick to learning CUDA? :)
Upvotes: 2
Views: 1427
Reputation: 913
The most effective utilization of CUDA threads is in blocks of 32. These are called warps.
1 warp=32 threads.
So, model your code keeping this in mind.
Upvotes: 1
Reputation: 99
What do you mean execute 128 times? If you need to iterate over the array for 128 times and each iteration depends on the previous results then you need to chop up the array into reasonable pieces and run code then synchronize and repeat.
In general if you only have 128 elements then running them all in one block should be alright because the access to the memory may be faster.
Upvotes: 0
Reputation: 66
It depends. The first invocation will create multiple blocks, but each block will be too small to efficiently use the multiprocessors of your GPU (they're even smaller than the warp size). The second invocation won't make use of the multiple multiprocessors on your GPU. If you really only need 128 threads then I'd suggest you to try something along the lines of
mykernel<<<4, 32>>>(myarray);
But generally you'll need to benchmark your code with different parameters to optimize the performance anyway, YMMV.
Upvotes: 5