Cuda shared memory out of bounds when using only one block or too few threads

Question

I tried to implement vector sum reduction using CUDA on my own and encountered an error I could fix but not understand what the actual problem was.

I implemented the kernel below, which is pretty much same as used in NVIDIA's samples.

__global__ 
void reduce0(int *input, int *output)
{
    extern __shared__ int s_data[];

    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    s_data[tid] = input[i];
    __syncthreads();

    for( int s=1; s < blockDim.x; s *= 2) {
        if((tid % 2*s) == 0) {
            s_data[tid] += s_data[tid + s];
        }

        __syncthreads();
    }

    if(tid == 0) {
        output[blockIdx.x] = s_data[0];
    }
}

Furthermore, I calculated shared memory space as below on the host side

int sharedMemSize = numberOfValues * sizeof(int);

If there is more than 1 block of threads used, the code just runs fine. Using only 1 block ends in the index out of bounds error mentioned above. Looking for my error by comparing my host code with the examples I found the following line:

int smemSize = (threads <= 32) ? 2 * threads * sizeof(T) : threads * sizeof(T);

Playing a little with my block/grid setup brought me to the following results:

1 block, arbitrary number of threads => code crashes
More than 2 blocks, arbitrary number of threads => code runs fine
1 block, arbitrary number of threads, shared memory size 2*#threads => code runs fine

Although thinking about this for a few hours, I don't get why there is an out of bounds error when using too few threads or blocks.

UPDATE: Host code calling the kernel as requested

int numberOfValues = 1024 ;
int numberOfThreadsPerBlock = 32;
int numberOfBlocks = numberOfValues / numberOfThreadsPerBlock;

int memSize = sizeof(int) * numberOfValues;

int *values = (int *) malloc(memSize);
int *result = (int *) malloc(memSize);

int *values_device, *result_device;
cudaMalloc((void **) &values_device, memSize);
cudaMalloc((void **) &result_device, memSize);

for(int i=0; i < numberOfValues ; i++) {
    values[i] = i+1;
}

cudaMemcpy(values_device, values, memSize, cudaMemcpyHostToDevice);

dim3 dimGrid(numberOfBlocks,1);
dim3 dimBlock(numberOfThreadsPerBlock,1);
int sharedMemSize = numberOfThreadsPerBlock * sizeof(int);

reduce0 <<< dimGrid, dimBlock, sharedMemSize >>>(values_device, result_device);

if (cudaSuccess != cudaGetLastError())
        printf( "Error!
" );

cudaMemcpy(result, result_device, memSize, cudaMemcpyDeviceToHost);

brano · Accepted Answer

could your problem be the precedence order of modulo and multiplication. tid % 2*s is equal to (tid % s)*2 but you want tid % (s*2)

The reason to why you need to use int smemSize = (threads <= 32) ? 2 * threads * sizeof(T) : threads * sizeof(T) for small number of threads is due to out of bounds indexing. One example when this happens is when you launch 29 threads. When tid=28 and s=2 the branch will be taken due to 28 % (2*2) == 0 and you will index into s_data[28+2] but you have only allocated shared memory for 29 threads.

Cuda shared memory out of bounds when using only one block or too few threads

Answers (1)

Related Questions