Reputation: 253
I have read the article Optimizing Parallel Reduction in CUDA by Mark Harris, and I found it really very useful, but still I am sometimes unable to understand 1 or 2 concepts. It is written on pg 18:
//First add during load
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
Optimized Code: With 2 loads and 1st add of the reduction:
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x; ...1
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; ...2
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x]; ...3
__syncthreads(); ...4
I am unable to understand line 2; if I have 256 elements, and if I choose 128 as my blocksize, then why I am multiplying it with 2? Please explain how to determine the blocksize?
Upvotes: 9
Views: 4889
Reputation: 1138
In the optimized code you run the kernel with blocks half as large as in the non-optimized implementation.
Let's call the size of the block in non-optimized code work
, let half of this size be called unit
, and let these sizes have same numerical value for the optimized code as well.
In the non-optimized code you run the kernel with as many threads as the work
is, that is blockDim = 2 * unit
. The code in each block just copies part of g_idata
to an array in shared memory, of size 2 * unit
.
In the optimized code blockDim = unit
, so there are now 1/2 of the threads, and the array in shared memory is 2x smaller. In line 3 first summand comes from even units, while second from odd units. In this way all the data required for reduction is taken into account.
Example:
If you run non-optimized kernel with blockDim=256=work
(single block, unit=128
), then optimized code has a single block of blockDim=128=unit
. Since this block gets blockIdx=0
, the *2
does not matter; the first thread does g_idata[0] + g_idata[0 + 128]
.
If you had 512 elements, and run non-optimized with 2 blocks of size 256 (work=256
, unit=128
), then optimized code has 2 blocks, but now of size 128. The first thread in second block (blockIdx=1
) does g_idata[2*128] + g_idata[2*128+128]
.
Upvotes: 1
Reputation: 51393
Basically, it is performing the operation shown in the picture below:
This code is basically saying that half of the threads will performance the reading from global memory and writing to shared memory, as shown in the picture.
You execute a Kernel, and now you want to reduce some values, you limit the access to the code above to only half of the total of threads running. Imagining you have 4 blocks, each one with 512 threads, you limit the code above to only be executed by the first two blocks, and you have a g_idate[4*512]
:
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
So:
thread 0 of block = 0 will copy the position 0 and 512,
thread 1 of block = 0 position 1 and 513;
thread 511 of block = 0 position 511 and 1023;
thread 0 of block 1 position 1024 and 1536
thread 511 of block = 1 position 1535 and 2047
The blockDim.x*2
is used because each thread will access to position i
and i+blockDim.x
so you need to multiple by 2
to guarantee that the threads on next id
block do not compute the position of g_idata
already computed.
Upvotes: 8