Reputation: 387
Relying on NVIDIA's samples and on some good advice found here at SO, I have been managing to implement a few array-reduction kernels that I need for my project. However, one particular issue remains causing me trouble. It is, how to properly do sum-reduction for arrays of unsigned chars (uchar
).
Because uchar
can hold values from 0 to 255, of course the thread blocks can't accumulate a value greater than 255 per thread block. My intuition was that it would be merely a case of collecting the sums inside the sum-reduction function in an int
despite the input being uchar
. However, it does not work.
Let me show in detail what I have. Below is my kernel to sum-reduce an array of uchar
- it is a slighly modified version of the famous reduce6
function in NVIDIA's samples:
template <class T, unsigned int blockSize>
__global__ void reduce6(int n, T *g_idata, int *g_odata)
{
extern __shared__ T sdata[];
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize * 2 + threadIdx.x;
unsigned int gridSize = blockSize * 2 * gridDim.x;
int mySum = 0;
// we reduce multiple elements per thread. The number is determined by the
// number of active thread blocks (via gridDim). More blocks will result
// in a larger gridSize and therefore fewer elements per thread
while (i < n)
{
mySum += g_idata[i];
// ensure we don't read out of bounds
if (i + blockSize < n) mySum += g_idata[i + blockSize];
i += gridSize;
}
// each thread puts its local sum into shared memory
sdata[tid] = mySum;
__syncthreads();
// do reduction in shared mem
if ((blockSize >= 512) && (tid < 256))
sdata[tid] = mySum = mySum + sdata[tid + 256];
__syncthreads();
if ((blockSize >= 256) && (tid < 128))
sdata[tid] = mySum = mySum + sdata[tid + 128];
__syncthreads();
if ((blockSize >= 128) && (tid < 64))
sdata[tid] = mySum = mySum + sdata[tid + 64];
__syncthreads();
// fully unroll reduction within a single warp
if ((blockSize >= 64) && (tid < 32))
sdata[tid] = mySum = mySum + sdata[tid + 32];
__syncthreads();
if ((blockSize >= 32) && (tid < 16))
sdata[tid] = mySum = mySum + sdata[tid + 16];
__syncthreads();
if ((blockSize >= 16) && (tid < 8))
sdata[tid] = mySum = mySum + sdata[tid + 8];
__syncthreads();
if ((blockSize >= 8) && (tid < 4))
sdata[tid] = mySum = mySum + sdata[tid + 4];
__syncthreads();
if ((blockSize >= 4) && (tid < 2))
sdata[tid] = mySum = mySum + sdata[tid + 2];
__syncthreads();
if ((blockSize >= 2) && (tid < 1))
mySum += sdata[tid + 1];
__syncthreads();
// write result for this block to global mem
if (tid == 0) atomicAdd(g_odata, mySum);
}
When such kernel is called by using reduce6<uchar, Blocksize>
such that Blocksize*num.threads = 256
, everything works properly and the sum-reduction gets the right result. Whenever such ratio is not 256, the result of the sum-reduction becomes wrong - which is merely due to what I said in the bebinning, i.e. uchar
can't acumulate values greater than 255.
To me, the intuitive solution would be to simply change the line:
extern __shared__ T sdata[];
To:
extern __shared__ int sdata[];
Since sdata
is a shared array created within the sum-reduction kernel, I thought that it could be of any type and thus properly accumulate whatever values result from the thread-block summation. Maybe, to make it sure, I even wrote the while
loop with an explicit conversion of the income data into int
:
while (i < n)
{
mySum += (int)g_idata[i];
// ensure we don't read out of bounds
if (i + blockSize < n) mySum += (int)g_idata[i + blockSize];
i += gridSize;
}
However, to my surprise, all tha only makes the sum-reduction result to be always zero.
What am I missing? How could I alter such kernel to make it so that the uchar
array being passed can be properly sum-reduced with arbitrary number of thread-blocks and threads?
If needed, a full example code can be found at: http://pastebin.com/nq1VRJCs
Upvotes: 0
Views: 536
Reputation: 7245
Add error checking to find that your kernel is not running at all in the cases where the returned sum is zero.
Run your program under cuda-memcheck to find that your kernel is producing out-of-bounds shared memory accesses when you change the type of the shared memory array or when you increase the blocksize beyond 256.
Then see that the size calculation in your full code on pastebin is incorrect for block sizes larger than 256, or when it's explicit reference to the type of shared memory array is not adjusted together with the actual type used in the kernel:
int smemSize = (threads <= 256) ?
2 * threads * sizeof(uchar) :
threads * sizeof(uchar);
You have no such case differentiation in the kernel code itself.
Upvotes: 2
Reputation: 72350
The problem is most likely in code you have not shown here:
int smemSize = (threads <= 256) ?
2 * threads * sizeof(uchar) :
threads * sizeof(uchar);
reduce6<uchar, 256> <<< dimGrid, dimBlock, smemSize>>>
(DATA_LENGTH, d_data1, d_reduced);
If you have changed the type of the shared memory buffer within the kernel, you must change its size in the kernel call as well.
The reason the results are zero in this case will be because the kernel is never running to completion. If you ran the code with cuda-memcheck, or added appropriate runtime API error checking, you would already know this.
Upvotes: 2