Reputation: 9216
I implemented some CUDA code. It runs fine but the alogrithm inherently produces a strong thread divergence. This is expected.
I will later try to reduce divergence. But for the moment I would be happy to be able to measure it.
Is there an easy way (prefereably using a runtime API call or a CLI tool) to check how many of my initially scheduled warps and/or threads are still active?
Upvotes: 0
Views: 701
Reputation: 9216
I found a solution that gives me pretty nice results. Calling the following function from some lines of a kernel (and adapted using a proper filter condition) prints the number of active threads of the current warp:
__device__ void printConvergentThreadCount(int line) // Pass __LINE__
{
const int count = __popc(__activemask());
const int threadId = blockIdx.x * blockDim.x + threadIdx.x;
if (threadId == 0) // Filter
{
printf("Line %i: %i\n", line, count);
}
}
Still this doesn't give numbers as long as kernels are running.
Upvotes: 1
Reputation: 1340
Besides the solutions given in the comments, you can use Nsight Compute
to profile your kernels. You can try its CLI and then see the results in its GUI, e.g.:
ncu --export output --force-overwrite --target-processes application-only \
--replay-mode kernel --kernel-regex-base function --launch-skip-before-match 0 \
--section InstructionStats \
--section Occupancy \
--section SchedulerStats \
--section SourceCounters \
--section WarpStateStats \
--sampling-interval auto \
--sampling-max-passes 5 \
--profile-from-start 1 --cache-control all --clock-control base \
--apply-rules yes --import-source no --check-exit-code yes \
your-appication [arguments]
Then, in its GUI you can see some useful information. For example, in the section source counters you can see something like this:
Upvotes: 2