Reputation: 1205
I'm working on a Kepler K20c GPU and am writing a Monte Carlo simulation for particle movement. Since my code is embarrassingly parallel, I want to use as many particles as I can fit into my memory.
On a Kepler K20c GPU, I'm limited to a maximum of 65535 blocks and 1024 threads per blocks. I would like to know the best practice for launching CUDA kernels, given a situation where my number of particles is larger than the maximum number of blocks times threads per blocks.
Consider therefore the simplified case where I perform vector addition C=A+B
with number of blocks NB
and number of threads per block NTpB
. and let A,B,C
be of dimension N=k*NTpB*NB
, i.e. the number of total threads multiplied by some factor k>1
. Now usually I would start a kernel via
add <<<NB,NTpB>>>(A,B,C,N)
where my kernel could look as follows.
_global__ void vecAdd(double *a, double *b, double *c, int n)
{
// Get our global thread ID
int id = blockIdx.x*blockDim.x+threadIdx.x;
// Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
}
Now the problem with this code, is that I only compute C=A+B
for the first N
elements, but the remaining (k-1)*N
entries of C
remain untouched.
A solution I came up with might be to instead call
add <<<NB,NTpB>>>(A,B,C,N,k,NB,NTpB)
with
_global__ void vecAdd(double *a, double *b, double *c, int n,int k,int NB, int NTpB)
{
for (int i = 0; i < k; i++){ // this is new
// Get our global thread ID
int id = blockIdx.x*blockDim.x+threadIdx.x
+k*NB*NTpB; // this is new
// Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
}
}
But here I'm not sure if this is doing the correct thing and also I assume that his is terrible in terms of efficiency as I jump around in the memory.
Are there any references dealing with this problem, or some suggestions on how to handle this in a better way?
Thank you very much.
Upvotes: 0
Views: 153
Reputation: 905
As Jez points out, GPUs with compute capability 3.0 or more are not constrained by the 65535 limit anymore. If we read Table 12. Technical Specifications per Compute Capability in "Maximum x-dimension of a grid of thread blocks" row, the limit for your card (CC 3.5) is 2^31-1.
Be aware that you should compile with -gencode arch=compute_35,code=sm_35
or nvcc will fall back to compile for CC 2.0 and then you code will break.
Wenever you are forced to work in a dataset larger than the usual NB*NTpB then you have to use a "grid-stride loop" presented here by Mark Harris.
Here an example:
_global__ void vecAdd(const double __restrict__ *a,
const double __restrict__ *b,
double __restrict__ *c,
int n)
{
for (int id = blockIdx.x * blockDim.x + threadIdx.x;
id < n;
id += blockDim.x * gridDim.x) {
c[id] = a[id] + b[id];
}
}
For 2D (or more) dimensional kernels the 65535 limitation is enough to work with the maximum dataset that fits in a GPU.
Upvotes: 2