John Styles
John Styles

Reputation: 95

How do i calculate the number of CUDA threads being launched?

I have a CUDA card with : Cuda Compute capability (3.5) If i have a call such as <<<2000,512>>> , what are the number of iterations that occur within the kernel? I thought it was (2000*512), but testing isn't proving this? I also want to confirm that the way I'm calculating the the variable is correct.

The situation is, within the kernel I am incrementing a passed global memory number based on the thread number :

  int thr = blockDim.x * blockIdx.x + threadIdx.x;
  worknumber = globalnumber + thr;

So, when I return back to the CPU, I want to know exactly how many increments there were so I can keep track so I don't repeat or skip numbers when I recall the kernel GPU to process my next set of numbers.

Edit :

__global__ void allin(uint64_t *lkey, const unsigned char *d_patfile)
{

    uint64_t kkey;
    int tmp;
    int thr = blockDim.x * blockIdx.x + threadIdx.x;
    kkey = *lkey + thr;

if (thr > tmp) {
    tmp = thr;
    printf("%u \n", thr);
    }
}

Upvotes: 2

Views: 5329

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152164

If you launch a kernel with the configuration <<<X,Y>>>, and you have not violated any rules of CUDA usage, then the number of threads launched will, in fact, be X*Y (or a suitable modification of that if we are talking about 2 or 3 dimensional threadblocks and/or grids, i.e. X.x*X.y*X.z*Y.x*Y.y*Y.z ).

printf from a CUDA kernel has various limitations. Therefore, generating a large amount of printf output from a CUDA kernel is generally unwise and probably not useful for validating the number of threads launched in a large grid.

If you want to keep track of the number of threads that actually get launched, you could use a global variable and have each thread atomically update it. Something like this:

$ cat t848.cu
#include <stdio.h>

__device__ unsigned long long totThr = 0;

__global__ void mykernel(){

  atomicAdd(&totThr, 1);
}

int main(){

  mykernel<<<2000,512>>>();
  unsigned long long total;
  cudaMemcpyFromSymbol(&total, totThr, sizeof(unsigned long long));
  printf("Total threads counted: %lu\n", total);
}
$ nvcc -o t848 t848.cu
$ cuda-memcheck ./t848
========= CUDA-MEMCHECK
Total threads counted: 1024000
========= ERROR SUMMARY: 0 errors
$

Note that atomic operations may be relatively slow. I wouldn't recommend making regular use of such a code for performance reasons. But if you want to convince yourself of the number of threads launched, it should give the correct answer.

Upvotes: 6

Related Questions