Reputation: 95
I have a CUDA card with : Cuda Compute capability (3.5) If i have a call such as <<<2000,512>>> , what are the number of iterations that occur within the kernel? I thought it was (2000*512), but testing isn't proving this? I also want to confirm that the way I'm calculating the the variable is correct.
The situation is, within the kernel I am incrementing a passed global memory number based on the thread number :
int thr = blockDim.x * blockIdx.x + threadIdx.x;
worknumber = globalnumber + thr;
So, when I return back to the CPU, I want to know exactly how many increments there were so I can keep track so I don't repeat or skip numbers when I recall the kernel GPU to process my next set of numbers.
Edit :
__global__ void allin(uint64_t *lkey, const unsigned char *d_patfile)
{
uint64_t kkey;
int tmp;
int thr = blockDim.x * blockIdx.x + threadIdx.x;
kkey = *lkey + thr;
if (thr > tmp) {
tmp = thr;
printf("%u \n", thr);
}
}
Upvotes: 2
Views: 5329
Reputation: 152164
If you launch a kernel with the configuration <<<X,Y>>>
, and you have not violated any rules of CUDA usage, then the number of threads launched will, in fact, be X*Y (or a suitable modification of that if we are talking about 2 or 3 dimensional threadblocks and/or grids, i.e. X.x*X.y*X.z*Y.x*Y.y*Y.z
).
printf
from a CUDA kernel has various limitations. Therefore, generating a large amount of printf
output from a CUDA kernel is generally unwise and probably not useful for validating the number of threads launched in a large grid.
If you want to keep track of the number of threads that actually get launched, you could use a global variable and have each thread atomically update it. Something like this:
$ cat t848.cu
#include <stdio.h>
__device__ unsigned long long totThr = 0;
__global__ void mykernel(){
atomicAdd(&totThr, 1);
}
int main(){
mykernel<<<2000,512>>>();
unsigned long long total;
cudaMemcpyFromSymbol(&total, totThr, sizeof(unsigned long long));
printf("Total threads counted: %lu\n", total);
}
$ nvcc -o t848 t848.cu
$ cuda-memcheck ./t848
========= CUDA-MEMCHECK
Total threads counted: 1024000
========= ERROR SUMMARY: 0 errors
$
Note that atomic operations may be relatively slow. I wouldn't recommend making regular use of such a code for performance reasons. But if you want to convince yourself of the number of threads launched, it should give the correct answer.
Upvotes: 6