Reputation: 15471
I am working through the Cuda Parallel reduction Whitepaper, but unfortunately my algorithm seems to repeatedly produce incorrect results, and I can not seem to figure out why(surely a textbook example must work? Surely I'm just doing something very obvious wrong?). Here is my kernel function:
My define:
#define BLOCK_SIZE 512
My Kernel function:
__global__ void total(float * inputList, float * outputList, int len) {
__shared__ float sdata[2*BLOCK_SIZE];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[t] = inputList[i]+inputList[i+blockDim.x];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0)
outputList[blockIdx.x] = sdata[0];
}
My memory allocation:
outputSize = inputSize / (BLOCK_SIZE<<1);
cudaMalloc((void**) &deviceInput, inputSize*sizeof(float));
cudaMalloc((void**) &deviceOutput, outputSize*sizeof(float));
cudaMemcpy(deviceInput, hostInput, inputSize*sizeof(float), cudaMemcpyHostToDevice);
My device call:
dim3 dimGrid((inputSize-1)/BLOCK_SIZE +1, 1, 1);
dim3 dimBlock(BLOCK_SIZE,1,1);
total<<<dimBlock, dimGrid>>>(deviceInput, deviceOutput, outputSize);
cudaDeviceSynchronize();
My memory fetch:
cudaMemcpy(hostOutput, deviceOutput, outputSize*sizeof(float), cudaMemcpyDeviceToHost);
And finally my final calculation:
for (int counter = 1; counter < outputSize; counter++) {
hostOutput[0] += hostOutput[counter];
}
Any help would be appreciated.
Upvotes: 1
Views: 973
Reputation: 16796
Your kernel launch configuration in the following line of your code is incorrect.
total<<<dimBlock, dimGrid>>>(deviceInput, deviceOutput, outputSize);
The first argument of kernel configuration is the grid size and the second argument is the block size.
You should be doing this:
total<<<dimGrid, dimBlock>>>(deviceInput, deviceOutput, outputSize);
Please always perform error checking on CUDA Runtime function calls and check the returned error codes to get the reason for the failure of your program.
Your kernel launch should fail in your current code. An error checking on the cudaDeviceSynchronize
call would have led you to the reason of incorrect results.
Upvotes: 5
Reputation: 579
The code assumes the input size is a multiple of the block size. If inputSize is not a multiple of the block size, it will read off the end of the inputList array.
Upvotes: 3