Parallel Reduction

Question

I have read the article Optimizing Parallel Reduction in CUDA by Mark Harris, and I found it really very useful, but still I am sometimes unable to understand 1 or 2 concepts. It is written on pg 18:

//First add during load

// each thread loads one element from global to shared mem

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[tid] = g_idata[i];
__syncthreads();

Optimized Code: With 2 loads and 1st add of the reduction:

// perform first level of reduction,

// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;                                    ...1

unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;          ...2

sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];                   ...3

__syncthreads();                                                   ...4

I am unable to understand line 2; if I have 256 elements, and if I choose 128 as my blocksize, then why I am multiplying it with 2? Please explain how to determine the blocksize?

dreamcrash · Accepted Answer

Basically, it is performing the operation shown in the picture below:

enter image description here

This code is basically saying that half of the threads will performance the reading from global memory and writing to shared memory, as shown in the picture.

You execute a Kernel, and now you want to reduce some values, you limit the access to the code above to only half of the total of threads running. Imagining you have 4 blocks, each one with 512 threads, you limit the code above to only be executed by the first two blocks, and you have a g_idate[4*512]:

unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;  

sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];

So:

thread 0 of block = 0  will copy the position 0 and 512,  
thread 1 of block = 0 position 1 and 513;
thread 511 of block = 0 position 511 and 1023;
thread 0 of block 1 position 1024 and 1536
thread 511 of block = 1 position 1535 and 2047

The blockDim.x*2 is used because each thread will access to position i and i+blockDim.x so you need to multiple by 2 to guarantee that the threads on next id block do not compute the position of g_idata already computed.

Parallel Reduction

Answers (2)

Related Questions