Reputation: 653
I tried to implement vector sum reduction using CUDA on my own and encountered an error I could fix but not understand what the actual problem was.
I implemented the kernel below, which is pretty much same as used in NVIDIA's samples.
__global__
void reduce0(int *input, int *output)
{
extern __shared__ int s_data[];
int tid = threadIdx.x;
int i = blockIdx.x * blockDim.x + threadIdx.x;
s_data[tid] = input[i];
__syncthreads();
for( int s=1; s < blockDim.x; s *= 2) {
if((tid % 2*s) == 0) {
s_data[tid] += s_data[tid + s];
}
__syncthreads();
}
if(tid == 0) {
output[blockIdx.x] = s_data[0];
}
}
Furthermore, I calculated shared memory space as below on the host side
int sharedMemSize = numberOfValues * sizeof(int);
If there is more than 1 block of threads used, the code just runs fine. Using only 1 block ends in the index out of bounds error mentioned above. Looking for my error by comparing my host code with the examples I found the following line:
int smemSize = (threads <= 32) ? 2 * threads * sizeof(T) : threads * sizeof(T);
Playing a little with my block/grid setup brought me to the following results:
2*#threads
=> code runs fineAlthough thinking about this for a few hours, I don't get why there is an out of bounds error when using too few threads or blocks.
UPDATE: Host code calling the kernel as requested
int numberOfValues = 1024 ;
int numberOfThreadsPerBlock = 32;
int numberOfBlocks = numberOfValues / numberOfThreadsPerBlock;
int memSize = sizeof(int) * numberOfValues;
int *values = (int *) malloc(memSize);
int *result = (int *) malloc(memSize);
int *values_device, *result_device;
cudaMalloc((void **) &values_device, memSize);
cudaMalloc((void **) &result_device, memSize);
for(int i=0; i < numberOfValues ; i++) {
values[i] = i+1;
}
cudaMemcpy(values_device, values, memSize, cudaMemcpyHostToDevice);
dim3 dimGrid(numberOfBlocks,1);
dim3 dimBlock(numberOfThreadsPerBlock,1);
int sharedMemSize = numberOfThreadsPerBlock * sizeof(int);
reduce0 <<< dimGrid, dimBlock, sharedMemSize >>>(values_device, result_device);
if (cudaSuccess != cudaGetLastError())
printf( "Error!\n" );
cudaMemcpy(result, result_device, memSize, cudaMemcpyDeviceToHost);
Upvotes: 0
Views: 1619
Reputation: 2872
could your problem be the precedence order of modulo and multiplication.
tid % 2*s
is equal to (tid % s)*2
but you want tid % (s*2)
The reason to why you need to use int smemSize = (threads <= 32) ? 2 * threads * sizeof(T) : threads * sizeof(T)
for small number of threads is due to out of bounds indexing. One example when this happens is when you launch 29 threads. When tid=28
and s=2
the branch will be taken due to 28 % (2*2) == 0
and you will index into s_data[28+2]
but you have only allocated shared memory for 29 threads.
Upvotes: 2