CUDA Best Practice if NumberThreads times NumberBlocks is too small

Question

I'm working on a Kepler K20c GPU and am writing a Monte Carlo simulation for particle movement. Since my code is embarrassingly parallel, I want to use as many particles as I can fit into my memory.

On a Kepler K20c GPU, I'm limited to a maximum of 65535 blocks and 1024 threads per blocks. I would like to know the best practice for launching CUDA kernels, given a situation where my number of particles is larger than the maximum number of blocks times threads per blocks.

Consider therefore the simplified case where I perform vector addition C=A+B with number of blocks NB and number of threads per block NTpB. and let A,B,C be of dimension N=k*NTpB*NB, i.e. the number of total threads multiplied by some factor k>1. Now usually I would start a kernel via

add <<>>(A,B,C,N)

where my kernel could look as follows.

_global__ void vecAdd(double *a, double *b, double *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;
 
    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

Now the problem with this code, is that I only compute C=A+B for the first N elements, but the remaining (k-1)*N entries of C remain untouched.

A solution I came up with might be to instead call

add <<>>(A,B,C,N,k,NB,NTpB)

with

_global__ void vecAdd(double *a, double *b, double *c, int n,int k,int NB, int NTpB)
{
    for (int i = 0; i < k; i++){                  // this is new
       
       // Get our global thread ID
       int id = blockIdx.x*blockDim.x+threadIdx.x
                  +k*NB*NTpB;                     // this is new
 
       // Make sure we do not go out of bounds
       if (id < n)
           c[id] = a[id] + b[id];
    }
}

But here I'm not sure if this is doing the correct thing and also I assume that his is terrible in terms of efficiency as I jump around in the memory.

Are there any references dealing with this problem, or some suggestions on how to handle this in a better way?

Thank you very much.

CUDA Best Practice if NumberThreads times NumberBlocks is too small

Answers (1)

Related Questions