Wrong sizes for block reduction in CUDA?

Question

I was checking out this sum_reduction.cu example and tutorial and noticed that for certain problem sizes it doesn't work e.g. it works with problem size n=2000 but not with n=3000. Apparently it always work with problem sizes that are multiple of the block size but neither the tutorial nor the example code states so. The question is, does this reduction algorithm only works for certain problem sizes? the example they chose N=256k which is even, a power of two and also multiple of the block size 512.

For self containment I paste the most important bits of (a template version of) the code here:

template
__global__ void kernelSum(const T* __restrict__ input, T* __restrict__ per_block_results, const size_t n) {
    extern __shared__ T sdata[];

    size_t tid = blockIdx.x * blockDim.x + threadIdx.x;

    // load input into __shared__ memory
    T x = 0.0;
    if (tid < n) {
        x = input[tid];
    }
    sdata[threadIdx.x] = x;
    __syncthreads();

    // contiguous range pattern
    for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) {
        if(threadIdx.x < offset) {
            // add a partial sum upstream to our own
            sdata[threadIdx.x] += sdata[threadIdx.x + offset];
        }
        // wait until all threads in the block have
        // updated their partial sums
        __syncthreads();
    }

    // thread 0 writes the final result
    if(threadIdx.x == 0) {
        per_block_results[blockIdx.x] = sdata[0];
    }
}

and to invoke the kernel:

// launch one kernel to compute, per-block, a partial sum
block_sum <<>>(d_input, d_partial_sums_and_total, num_elements);

// launch a single block to compute the sum of the partial sums
block_sum <<<1,num_blocks,num_blocks * sizeof(double)>>>(d_partial_sums_and_total, d_partial_sums_and_total + num_blocks, num_blocks);

To my understanding if the problem size is smaller than the block reduction this statement T x = 0.0; ensures that the element is zeroed out and thus should work but it doesn't?

UPDATE: I am sorry the float/double thing was a typo while preparing the question and not the real problem.

Robert Crovella · Accepted Answer

The code you have posted is not consistent, as your templated kernel is called kernelSum but you are invoking something called block_sum.
Furthermore, I don't believe your usage of the templated kernel function could possibly be correct as written:
```
block_sum <<>>(d_input, d_partial_sums_and_total, num_elements);
           ^                                                     ^
           |  these types are required to match                  |
```
The kernel template is being instantiated with type double. Therefore it is expecting enough shared memory to store block_size double quantities, based on this line:
```
extern __shared__ T sdata[];
```
But you are only passing half of the required storage:
```
block_size * sizeof(float)
```
I believe that's going to give you unexpected results.
The reduction as written does expect that the block dimension is a power of 2, due to this loop:
```
// contiguous range pattern
for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) {
```
This is not likely to be an issue on the first kernel call, because you are probably choosing a power of two for the number of threads per block (block_size):
```
block_sum <<
```
However, for the second kernel call, this will depend on whether num_blocks is a power of two, which depends on your grid calculations, which you haven't shown: block_sum <<<1,num_blocks,...


Finally, the first kernel launch will fail if num_blocks exceeds the limit for your device.  This may happen for very large data sets but probably not for size 3000, and it depends on your grid calculations which you haven't shown.



Item 3 above is a difficult requirement to satisfy on the fly for arbitrary vector sizes.  Therefore I would suggest an alternate reduction strategy to handle arbitrary sized vectors.  For this I would suggest that you study the CUDA reduction sample code and presentation.

Here's a complete program, mostly based on the code you have shown, that has the above issues addressed, and seems to work for me for a size of 3000:

#include 
#include 
#define DSIZE 3000
#define nTPB 256



template
__global__ void block_sum(const T* __restrict__ input, T* __restrict__ per_block_results, const size_t n) {
    extern __shared__ T sdata[];

    size_t tid = blockIdx.x * blockDim.x + threadIdx.x;

    // load input into __shared__ memory
    T x = 0.0;
    if (tid < n) {
        x = input[tid];
    }
    sdata[threadIdx.x] = x;
    __syncthreads();

    // contiguous range pattern
    for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) {
        if(threadIdx.x < offset) {
            // add a partial sum upstream to our own
            sdata[threadIdx.x] += sdata[threadIdx.x + offset];
        }
        // wait until all threads in the block have
        // updated their partial sums
        __syncthreads();
    }

    // thread 0 writes the final result
    if(threadIdx.x == 0) {
        per_block_results[blockIdx.x] = sdata[0];
    }
}


int main(){

  double *d_input, *d_partial_sums_and_total, *h_input, *h_partial_sums_and_total;

  int num_elements=DSIZE;
  int block_size = nTPB;
  int num_blocks = (num_elements + block_size -1)/block_size;
  // bump num_blocks up to the next power of 2
  int done = 0;
  int test_val = 1;
  while (!done){
   if (test_val >= num_blocks){
     num_blocks = test_val;
     done = 1;}
   else test_val *= 2;
   if (test_val > 65535) {printf("blocks failure
"); exit(1);}
   }


  h_input = (double *)malloc(num_elements * sizeof(double));
  h_partial_sums_and_total = (double *)malloc((num_blocks+1)*sizeof(double));

  cudaMalloc((void **)&d_input, num_elements * sizeof(double));
  cudaMalloc((void **)&d_partial_sums_and_total, (num_blocks+1)*sizeof(double));

  double h_result = 0.0;
  for (int i = 0; i < num_elements; i++) {
    h_input[i] = rand()/(double)RAND_MAX;
    h_result += h_input[i];}

  cudaMemcpy(d_input, h_input, num_elements*sizeof(double), cudaMemcpyHostToDevice);
  cudaMemset(d_partial_sums_and_total, 0, (num_blocks+1)*sizeof(double));

// launch one kernel to compute, per-block, a partial sum
  block_sum <<>>(d_input, d_partial_sums_and_total, num_elements);

// launch a single block to compute the sum of the partial sums
  block_sum <<<1,num_blocks,num_blocks * sizeof(double)>>>(d_partial_sums_and_total, d_partial_sums_and_total + num_blocks, num_blocks);

  cudaMemcpy(h_partial_sums_and_total, d_partial_sums_and_total, (num_blocks+1)*sizeof(double), cudaMemcpyDeviceToHost);

  printf("host result   = %lf
", h_result);
  printf("device result = %lf
", h_partial_sums_and_total[num_blocks]);
}


For brevity/readability, I have dispensed with error checking in the above code.  When having difficulty with a cuda code, you should always do proper cuda error checking.

Also, in the future, you will make it easier for others to help you if you post a complete code to demonstrate what you are doing, as I have done above.

Wrong sizes for block reduction in CUDA?

Answers (1)

Related Questions