Reputation: 89
I'm studying the reduction from the well-known slides by Mark Harris. In particular, I implemented the optimization step #5 but I obtain a wrong result: 17 instead of 41. I used the same number sequence shown in the slides. I omitted in the code the "extern" for the shared array, because the kernel and the host code are in the same .cu file.
#include <stdio.h>
#include <cuda_runtime.h>
#define THREAD_PER_BLOCK 16
__global__ void reduce5(int *g_idata, int *g_odata) {
__shared__ int sdata[THREAD_PER_BLOCK];
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();
// do reduction in shared mem
for (unsigned int s=blockDim.x/2; s>32; s>>=1) {
if (tid < s) sdata[tid] += sdata[tid + s];
__syncthreads();
}
if (tid < 32)
{
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
int main()
{
int inputLength=16;
int hostInput[16]={10,1,8,-1,0,-2,3,5,-2,-3,2,7,0,11,0,2};
int hostOutput=0;
int *deviceInput;
int *deviceOutput;
cudaMalloc((void **)&deviceInput, inputLength * sizeof(int));
cudaMalloc((void **)&deviceOutput, sizeof(int));
cudaMemcpy(deviceInput, hostInput, inputLength * sizeof(int),cudaMemcpyHostToDevice);
reduce5<<<1,16>>>(deviceInput, deviceOutput);
cudaDeviceSynchronize();
cudaMemcpy(&hostOutput, deviceOutput,sizeof(int), cudaMemcpyDeviceToHost);
printf("%d\n",hostOutput);
cudaFree(deviceInput);
cudaFree(deviceOutput);
return 0;
}
Upvotes: 0
Views: 578
Reputation: 21
I encountered the same problem and found it turns out to be the threads are not actually in synchronization if the variable is not declared as volatile
.
Simply add volatile
when declaring sdata
the problem will be solved.
Please refer to my post: cuda Threads in A Warp appear to be not in synchronization
Upvotes: 1
Reputation: 72349
In this code THREAD_PER_BLOCK
must be a multiple of 32, and at least 64. The length of the input data must also be twice the product of block and grid sizes.
You don't see it (because you aren't performing any sort of error checking), but the thread and warp reductions will be failing because of out-of-bounds shared memory and global memeory access.
Also note that extern __shared__
has nothing to do with whether the kernel and other code are in the same file. That denotes that shared memory for that variable will be dynamically allocated at runtime, rather than statically at compile time. The size of the allocation is passed as the third argument in the kernel launch syntax.
Upvotes: 2