Reputation: 15744
Is it possible to write a CUDA kernel that shows how many threads are in a warp without using any of the warp related CUDA device functions and without using benchmarking? If so, how?
Upvotes: 1
Views: 284
Reputation: 11549
Here are several easy solutions. There are other solutions that use warp synchronous programming; however, many of the solutions will not work across all devices.
SOLUTION 1: Launch one or more blocks with max threads per block, read the special registers %smid and %warpid, and blockIdx and write values to memory. Group data by the three variables to find the warp size. This is even easier if you limit the launch to a single block then you only need %warpid.
SOLUTION 2: Launch one block with max threads per block and read the special register %clock. This requires the following assumptions which can be shown to be true on CC 1.0-3.5 devices:
All threads in the block that have the same clock time on CC1.0 - 3.5 devices (may change in the future) will have the same clock time.
SOLUTION 3: Use Nsight VSE or cuda-gdb debugger. The warp state views show you sufficient information to determine the warp size. It is also possible to single step and see the change to the PC address for each thread.
SOLUTION 4: Use Nsight VSE, Visual Profiler, nvprof, etc. Launch kernels of of 1 block with increasing thread count per launch. Determine when the thread count causing warps_launched to go from 1 to 2.
Upvotes: 2
Reputation: 152279
Since you indicated a solution with atomics would be interesting, I advance this as something that I believe gives an answer, but I'm not sure it is necessarily the answer you are looking for. I acknowledge it is somewhat statistical in nature. I provide this merely because I found the question interesting. I don't suggest that it is the "right" answer, and I suspect someone clever will come up with a "better" answer. This may provide some ideas, however.
In order to avoid using anything that explicitly references warps, I believe it is necessary to focus on "implicit" warp-synchronous behavior. I initially went down a path thinking about how to use an if-then-else construct, (which has some warp-synchronous implications) but struggled with that and came up with this approach instead:
#include <stdio.h>
#define LOOPS 100000
__device__ volatile int test2 = 0;
__device__ int test3 = 32767;
__global__ void kernel(){
for (int i = 0; i < LOOPS; i++){
unsigned long time = clock64();
// while (clock64() < (time + (threadIdx.x * 1000)));
int start = test2;
atomicAdd((int *)&test2, 1);
int end = test2;
int diff = end - start;
atomicMin(&test3, diff);
}
}
int main() {
kernel<<<1, 1024>>>();
int result;
cudaMemcpyFromSymbol(&result, test3, sizeof(int));
printf("result = %d threads\n", result);
return 0;
}
I compile with:
nvcc -O3 -arch=sm_20 -o t331 t331.cu
I call it "statistical" because it requres a large number of iterations (LOOPS
) to produce a correct estimate (32). As the iteration count is decreased, the "estimate" increases.
We can apply additional warp-synchronous leverage by uncommenting the line that is commented out in the kernel. For my test case*, with that line uncommented, the estimate is correct even when LOOPS
= 1
*my test case is CUDA 5, Quadro5000, RHEL 5.5
Upvotes: 2