Roger Dahl
Roger Dahl

Reputation: 15744

Empirically determining how many threads are in a warp

Is it possible to write a CUDA kernel that shows how many threads are in a warp without using any of the warp related CUDA device functions and without using benchmarking? If so, how?

Upvotes: 1

Views: 284

Answers (2)

Greg Smith
Greg Smith

Reputation: 11549

Here are several easy solutions. There are other solutions that use warp synchronous programming; however, many of the solutions will not work across all devices.

SOLUTION 1: Launch one or more blocks with max threads per block, read the special registers %smid and %warpid, and blockIdx and write values to memory. Group data by the three variables to find the warp size. This is even easier if you limit the launch to a single block then you only need %warpid.

SOLUTION 2: Launch one block with max threads per block and read the special register %clock. This requires the following assumptions which can be shown to be true on CC 1.0-3.5 devices:

  • %clock is defined as a unsigned 32-bit read-only cycle counter that wraps silently and updates every issue cycle
  • all threads in a warp read the same value for %clock
  • due to warp launch latency and instruction fetch warps on the same SM but different warp schedulers cannot issue the first instruction of a warp on the same cycle

All threads in the block that have the same clock time on CC1.0 - 3.5 devices (may change in the future) will have the same clock time.

SOLUTION 3: Use Nsight VSE or cuda-gdb debugger. The warp state views show you sufficient information to determine the warp size. It is also possible to single step and see the change to the PC address for each thread.

SOLUTION 4: Use Nsight VSE, Visual Profiler, nvprof, etc. Launch kernels of of 1 block with increasing thread count per launch. Determine when the thread count causing warps_launched to go from 1 to 2.

Upvotes: 2

Robert Crovella
Robert Crovella

Reputation: 152279

Since you indicated a solution with atomics would be interesting, I advance this as something that I believe gives an answer, but I'm not sure it is necessarily the answer you are looking for. I acknowledge it is somewhat statistical in nature. I provide this merely because I found the question interesting. I don't suggest that it is the "right" answer, and I suspect someone clever will come up with a "better" answer. This may provide some ideas, however.

In order to avoid using anything that explicitly references warps, I believe it is necessary to focus on "implicit" warp-synchronous behavior. I initially went down a path thinking about how to use an if-then-else construct, (which has some warp-synchronous implications) but struggled with that and came up with this approach instead:

#include <stdio.h>
#define LOOPS 100000

__device__ volatile int test2 = 0;
__device__ int test3 = 32767;

__global__ void kernel(){

  for (int i = 0; i < LOOPS; i++){
    unsigned long time = clock64();
//    while (clock64() < (time + (threadIdx.x * 1000)));
    int start = test2;
    atomicAdd((int *)&test2, 1);
    int end = test2;
    int diff = end - start;
    atomicMin(&test3, diff);
    }
}

int main() {

   kernel<<<1, 1024>>>();
   int result;
   cudaMemcpyFromSymbol(&result, test3, sizeof(int));
   printf("result = %d threads\n", result);
   return 0;
}

I compile with:

nvcc -O3 -arch=sm_20 -o t331 t331.cu

I call it "statistical" because it requres a large number of iterations (LOOPS) to produce a correct estimate (32). As the iteration count is decreased, the "estimate" increases.

We can apply additional warp-synchronous leverage by uncommenting the line that is commented out in the kernel. For my test case*, with that line uncommented, the estimate is correct even when LOOPS = 1

*my test case is CUDA 5, Quadro5000, RHEL 5.5

Upvotes: 2

Related Questions