c0dehunter
c0dehunter

Reputation: 6160

CUDA efficient kernel invocation

Guys please confirm if my thinking is right.

Suppose I have a kernel function mykernel(double *array). I want the code inside the kernel to be executed 128-times. I can do this in two ways when invoking the kernel from host:

mykernel<<<128, 1>>>(myarray);
//or
mykernel<<<1, 128>>>(myarray);

With first invocation I will create 128 blocks, each running 1 thread. In second invocation I will create 1 block with 128 threads. But since the code inside kernel is working on the same array, it is more efficient to use the second invocation.

Am I a fool or should I stick to learning CUDA? :)

Upvotes: 2

Views: 1427

Answers (3)

Code_Jamer
Code_Jamer

Reputation: 913

The most effective utilization of CUDA threads is in blocks of 32. These are called warps.

1 warp=32 threads.

So, model your code keeping this in mind.

Upvotes: 1

yurlungurrr
yurlungurrr

Reputation: 99

What do you mean execute 128 times? If you need to iterate over the array for 128 times and each iteration depends on the previous results then you need to chop up the array into reasonable pieces and run code then synchronize and repeat.

In general if you only have 128 elements then running them all in one block should be alright because the access to the memory may be faster.

Upvotes: 0

gentryx
gentryx

Reputation: 66

It depends. The first invocation will create multiple blocks, but each block will be too small to efficiently use the multiprocessors of your GPU (they're even smaller than the warp size). The second invocation won't make use of the multiple multiprocessors on your GPU. If you really only need 128 threads then I'd suggest you to try something along the lines of

mykernel<<<4, 32>>>(myarray);

But generally you'll need to benchmark your code with different parameters to optimize the performance anyway, YMMV.

Upvotes: 5

Related Questions