Abraham P
Abraham P

Reputation: 15471

Parallel list reduction in CUDA

I am working through the Cuda Parallel reduction Whitepaper, but unfortunately my algorithm seems to repeatedly produce incorrect results, and I can not seem to figure out why(surely a textbook example must work? Surely I'm just doing something very obvious wrong?). Here is my kernel function:

My define:

 #define BLOCK_SIZE 512

My Kernel function:

 __global__ void total(float * inputList, float * outputList, int len) {
      __shared__ float sdata[2*BLOCK_SIZE];
      unsigned int tid = threadIdx.x;
      unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
      sdata[t] = inputList[i]+inputList[i+blockDim.x];
      __syncthreads();
      for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
        if (tid < s) {
          sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
      }
      if (tid == 0) 
        outputList[blockIdx.x] = sdata[0];
}

My memory allocation:

  outputSize = inputSize / (BLOCK_SIZE<<1);
  cudaMalloc((void**) &deviceInput, inputSize*sizeof(float));
  cudaMalloc((void**) &deviceOutput, outputSize*sizeof(float));
  cudaMemcpy(deviceInput, hostInput, inputSize*sizeof(float), cudaMemcpyHostToDevice);

My device call:

 dim3 dimGrid((inputSize-1)/BLOCK_SIZE +1, 1, 1);
 dim3 dimBlock(BLOCK_SIZE,1,1);

 total<<<dimBlock, dimGrid>>>(deviceInput, deviceOutput, outputSize);
 cudaDeviceSynchronize();

My memory fetch:

 cudaMemcpy(hostOutput, deviceOutput, outputSize*sizeof(float), cudaMemcpyDeviceToHost);

And finally my final calculation:

 for (int counter = 1; counter < outputSize; counter++) {
    hostOutput[0] += hostOutput[counter];
 }

Any help would be appreciated.

Upvotes: 1

Views: 973

Answers (2)

sgarizvi
sgarizvi

Reputation: 16796

Your kernel launch configuration in the following line of your code is incorrect.

total<<<dimBlock, dimGrid>>>(deviceInput, deviceOutput, outputSize); 

The first argument of kernel configuration is the grid size and the second argument is the block size.

You should be doing this:

total<<<dimGrid, dimBlock>>>(deviceInput, deviceOutput, outputSize); 

Please always perform error checking on CUDA Runtime function calls and check the returned error codes to get the reason for the failure of your program.

Your kernel launch should fail in your current code. An error checking on the cudaDeviceSynchronize call would have led you to the reason of incorrect results.

Upvotes: 5

Matt
Matt

Reputation: 579

The code assumes the input size is a multiple of the block size. If inputSize is not a multiple of the block size, it will read off the end of the inputList array.

Upvotes: 3

Related Questions