Sum-reducing an array of unsigned char with CUDA: how to properly accumulate thread-block results with uchars?

Question

Relying on NVIDIA's samples and on some good advice found here at SO, I have been managing to implement a few array-reduction kernels that I need for my project. However, one particular issue remains causing me trouble. It is, how to properly do sum-reduction for arrays of unsigned chars (uchar).

Because uchar can hold values from 0 to 255, of course the thread blocks can't accumulate a value greater than 255 per thread block. My intuition was that it would be merely a case of collecting the sums inside the sum-reduction function in an int despite the input being uchar. However, it does not work.

Let me show in detail what I have. Below is my kernel to sum-reduce an array of uchar - it is a slighly modified version of the famous reduce6 function in NVIDIA's samples:

template 
__global__ void reduce6(int n, T *g_idata, int *g_odata)
{
    extern __shared__ T sdata[];

    // perform first level of reduction,
    // reading from global memory, writing to shared memory
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockSize * 2 + threadIdx.x;
    unsigned int gridSize = blockSize * 2 * gridDim.x;

    int mySum = 0;

    // we reduce multiple elements per thread.  The number is determined by the
    // number of active thread blocks (via gridDim).  More blocks will result
    // in a larger gridSize and therefore fewer elements per thread
    while (i < n)
    {
        mySum += g_idata[i];
        // ensure we don't read out of bounds
        if (i + blockSize < n) mySum += g_idata[i + blockSize];
        i += gridSize;
    }

    // each thread puts its local sum into shared memory
    sdata[tid] = mySum;
    __syncthreads();


    // do reduction in shared mem
    if ((blockSize >= 512) && (tid < 256))
        sdata[tid] = mySum = mySum + sdata[tid + 256];
    __syncthreads();

    if ((blockSize >= 256) && (tid < 128))
        sdata[tid] = mySum = mySum + sdata[tid + 128];
     __syncthreads();

    if ((blockSize >= 128) && (tid <  64))
        sdata[tid] = mySum = mySum + sdata[tid + 64];
    __syncthreads();

    // fully unroll reduction within a single warp
    if ((blockSize >= 64) && (tid < 32))
        sdata[tid] = mySum = mySum + sdata[tid + 32];
    __syncthreads();

    if ((blockSize >= 32) && (tid < 16))
        sdata[tid] = mySum = mySum + sdata[tid + 16];
    __syncthreads();

    if ((blockSize >= 16) && (tid <  8))
        sdata[tid] = mySum = mySum + sdata[tid + 8];
    __syncthreads();

    if ((blockSize >= 8) && (tid <  4))
        sdata[tid] = mySum = mySum + sdata[tid + 4];
    __syncthreads();

    if ((blockSize >= 4) && (tid <  2))
        sdata[tid] = mySum = mySum + sdata[tid + 2];
    __syncthreads();

    if ((blockSize >= 2) && (tid <  1))
        mySum += sdata[tid + 1];
    __syncthreads();

    // write result for this block to global mem
    if (tid == 0)  atomicAdd(g_odata, mySum);
}

When such kernel is called by using reduce6 such that Blocksize*num.threads = 256, everything works properly and the sum-reduction gets the right result. Whenever such ratio is not 256, the result of the sum-reduction becomes wrong - which is merely due to what I said in the bebinning, i.e. uchar can't acumulate values greater than 255.

To me, the intuitive solution would be to simply change the line:

extern __shared__ T sdata[];

To:

extern __shared__ int sdata[];

Since sdata is a shared array created within the sum-reduction kernel, I thought that it could be of any type and thus properly accumulate whatever values result from the thread-block summation. Maybe, to make it sure, I even wrote the while loop with an explicit conversion of the income data into int:

    while (i < n)
    {
        mySum += (int)g_idata[i];
        // ensure we don't read out of bounds
        if (i + blockSize < n) mySum += (int)g_idata[i + blockSize];
        i += gridSize;
    }

However, to my surprise, all tha only makes the sum-reduction result to be always zero.

What am I missing? How could I alter such kernel to make it so that the uchar array being passed can be properly sum-reduced with arbitrary number of thread-blocks and threads?

If needed, a full example code can be found at: http://pastebin.com/nq1VRJCs

tera · Accepted Answer

Add error checking to find that your kernel is not running at all in the cases where the returned sum is zero.

Run your program under cuda-memcheck to find that your kernel is producing out-of-bounds shared memory accesses when you change the type of the shared memory array or when you increase the blocksize beyond 256.

Then see that the size calculation in your full code on pastebin is incorrect for block sizes larger than 256, or when it's explicit reference to the type of shared memory array is not adjusted together with the actual type used in the kernel:

int smemSize = (threads <= 256) ?
    2 * threads * sizeof(uchar) :
    threads * sizeof(uchar);

You have no such case differentiation in the kernel code itself.

Sum-reducing an array of unsigned char with CUDA: how to properly accumulate thread-block results with uchars?

Answers (2)

Related Questions