Empirically determining how many threads are in a warp

Question

Is it possible to write a CUDA kernel that shows how many threads are in a warp without using any of the warp related CUDA device functions and without using benchmarking? If so, how?

Robert Crovella · Accepted Answer

Since you indicated a solution with atomics would be interesting, I advance this as something that I believe gives an answer, but I'm not sure it is necessarily the answer you are looking for. I acknowledge it is somewhat statistical in nature. I provide this merely because I found the question interesting. I don't suggest that it is the "right" answer, and I suspect someone clever will come up with a "better" answer. This may provide some ideas, however.

In order to avoid using anything that explicitly references warps, I believe it is necessary to focus on "implicit" warp-synchronous behavior. I initially went down a path thinking about how to use an if-then-else construct, (which has some warp-synchronous implications) but struggled with that and came up with this approach instead:

#include 
#define LOOPS 100000

__device__ volatile int test2 = 0;
__device__ int test3 = 32767;

__global__ void kernel(){

  for (int i = 0; i < LOOPS; i++){
    unsigned long time = clock64();
//    while (clock64() < (time + (threadIdx.x * 1000)));
    int start = test2;
    atomicAdd((int *)&test2, 1);
    int end = test2;
    int diff = end - start;
    atomicMin(&test3, diff);
    }
}

int main() {

   kernel<<<1, 1024>>>();
   int result;
   cudaMemcpyFromSymbol(&result, test3, sizeof(int));
   printf("result = %d threads
", result);
   return 0;
}

I compile with:

nvcc -O3 -arch=sm_20 -o t331 t331.cu

I call it "statistical" because it requres a large number of iterations (LOOPS) to produce a correct estimate (32). As the iteration count is decreased, the "estimate" increases.

We can apply additional warp-synchronous leverage by uncommenting the line that is commented out in the kernel. For my test case*, with that line uncommented, the estimate is correct even when LOOPS = 1

*my test case is CUDA 5, Quadro5000, RHEL 5.5

Empirically determining how many threads are in a warp

Answers (2)

Related Questions