Parallel list reduction in CUDA

Question

I am working through the Cuda Parallel reduction Whitepaper, but unfortunately my algorithm seems to repeatedly produce incorrect results, and I can not seem to figure out why(surely a textbook example must work? Surely I'm just doing something very obvious wrong?). Here is my kernel function:

My define:

 #define BLOCK_SIZE 512

My Kernel function:

 __global__ void total(float * inputList, float * outputList, int len) {
      __shared__ float sdata[2*BLOCK_SIZE];
      unsigned int tid = threadIdx.x;
      unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
      sdata[t] = inputList[i]+inputList[i+blockDim.x];
      __syncthreads();
      for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
        if (tid < s) {
          sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
      }
      if (tid == 0) 
        outputList[blockIdx.x] = sdata[0];
}

My memory allocation:

  outputSize = inputSize / (BLOCK_SIZE<<1);
  cudaMalloc((void**) &deviceInput, inputSize*sizeof(float));
  cudaMalloc((void**) &deviceOutput, outputSize*sizeof(float));
  cudaMemcpy(deviceInput, hostInput, inputSize*sizeof(float), cudaMemcpyHostToDevice);

My device call:

 dim3 dimGrid((inputSize-1)/BLOCK_SIZE +1, 1, 1);
 dim3 dimBlock(BLOCK_SIZE,1,1);

 total<<>>(deviceInput, deviceOutput, outputSize);
 cudaDeviceSynchronize();

My memory fetch:

 cudaMemcpy(hostOutput, deviceOutput, outputSize*sizeof(float), cudaMemcpyDeviceToHost);

And finally my final calculation:

 for (int counter = 1; counter < outputSize; counter++) {
    hostOutput[0] += hostOutput[counter];
 }

Any help would be appreciated.

sgarizvi · Accepted Answer

Your kernel launch configuration in the following line of your code is incorrect.

total<<>>(deviceInput, deviceOutput, outputSize);

The first argument of kernel configuration is the grid size and the second argument is the block size.

You should be doing this:

total<<>>(deviceInput, deviceOutput, outputSize);

Please always perform error checking on CUDA Runtime function calls and check the returned error codes to get the reason for the failure of your program.

Your kernel launch should fail in your current code. An error checking on the cudaDeviceSynchronize call would have led you to the reason of incorrect results.

Parallel list reduction in CUDA

Answers (2)

Related Questions