Pantelis Sopasakis
Pantelis Sopasakis

Reputation: 1902

CUDA summation reduction puzzle

Reduction in CUDA has utterly baffled me! First off, both this tutorial by Mark Harris and this one by Mike Giles make use of the declaration extern __shared__ temp[]. The keyword extern is used in C when a declaration is made, but allocation takes place "elsewhre" (e.g. in another C file context in general). What is the relevance of extern here? Why don't we use:

__shared__ float temp[N/2];

for instance? Or why don't we declare temp to be a global variable, e.g.

#define N 1024
__shared__ float temp[N/2];

__global__ void sum(float *sum,  float *data){ ... }

int main(){
 ...
 sum<<<M,L>>>(sum, data);
}

I have yet another question? How many blocks and threads per block should one use to invoke the summation kernel? I tried this example (based on this).

Note: You can find information about my devices here.

Upvotes: 0

Views: 611

Answers (1)

talonmies
talonmies

Reputation: 72372

The answer to the first question is that CUDA supports dynamic shared memory allocation at runtime (see this SO question and the documentation for more details). The declaration of shared memory using extern denotes to the compiler that shared memory size will be determined at kernel launch, passed in bytes as an argument to the <<< >>> syntax (or equivalently via an API function), something like:

sum<<< gridsize, blocksize, sharedmem_size >>>(....);

The second question is normally to launch the number of blocks which will completely fill all the streaming multiprocessors on your GPU. Most sensibly written reduction kernels will accumulate many values per thread and then perform a shared memory reduction. The reduction requires that the number of threads per block be a power of two: That usually gives you 32, 64, 128, 256, 512 (or 1024 if you have a Fermi or Kepler GPU). It is a very finite search space, just benchmark to see what works best on your hardware. You can find a more general discussion about block and grid sizing here and here.

Upvotes: 2

Related Questions