Maximum number of threads for a CUDA kernel on Tesla M2050

Question

I am testing what is maximum number of threads for a simple kernel. I find total number of threads cannot exceed 4096. The code is as follows:

#include 
#define N 100

__global__ void test(){
    printf("%d %d
", blockIdx.x, threadIdx.x);
}

int main(void){
    double *p;
    size_t size=N*sizeof(double);
    cudaMalloc(&p, size);
    test<<<64,128>>>();
   //test<<<64,128>>>();
   cudaFree(p);
   return 0;
}

My test environment: CUDA 4.2.9 on Tesla M2050. The code is compiled with

 nvcc -arch=sm_20 test.cu

While checking what's the output, I found some combinations are missing. Run the command

./a.out|wc -l

I always got 4096. When I check cc2.0, I can only find the maximum number of blocks for x,y,z dimensions are (1024,1024,512) and maximum number of threads per block is 1024. And the calls to the kernel (either <<<64,128>>> or <<<128,64>>>) are well in the limits. Any idea?

NB: The CUDA memory operations are there to block the code so that the output from the kernel will be shown.

talonmies · Accepted Answer

You are abusing kernel printf, and using it to judge how many threads you can run is a completely nonsensical idea. The runtime has a limited buffer size for printf output, and you are simply overflowing it with output when you run enough threads. There is an API to query and set the printf buffer size, using cudaDeviceGetLimit and cudaDeviceSetLimit (thanks to Robert Crovella for the link to the printf documentation in comments).

You can find the maximum number of threads a given kernel can run by looking here in the documentation.

Maximum number of threads for a CUDA kernel on Tesla M2050

Answers (1)

Related Questions