silver surfer
silver surfer

Reputation: 258

Active warps in Cuda programming

I am trying to do performance analysis for my code using Nsight IDE.

I have taken a simple example of matrix addition.

I am calling my kernel like :

VecAdd<<<1,BLOCK_SIZEBLOCK_SIZE>>>(dA,dB,dC,BLOCK_SIZEBLOCK_SIZE);

Here BLOCK_SIZE is 16.

__global__ void VecAdd(float *dA, float *dB, float *dC, int N)
{
    int i = threadIdx.x;
    if (i < N)
        dC[i] = dA[i] + dB[i];
}

while doing occupancy analysis,

I am getting Active warps achieved as 0.97.

I am not able to figure out as to why.

I have attached a report. Can somebody please explain why is this happening?

enter image description here

Upvotes: 0

Views: 838

Answers (1)

Greg Smith
Greg Smith

Reputation: 11519

Achieved Occupancy is the percentage of active_warps / elapse_cycles / MAX_WARPS_PER_SM * 100.

Your kernel launch is 1 block of 8 warps. The achieved occupancy statistic shows you have an average of 1 warp active which is very low. The obvious question is why is this not 8.

Since you did not provide source I'll assume you modified the VecAdd CUDA SDK sample which does 5 constant reads, 2 32-bit global loads, 1 32-bit write, and some basic math for indexing and address calculation. This takes about 300 cycles per warp assuming all memory operations hit in L2. This is likely as you probably copied the arrays from host to device prior to the launch. The kernel duration itself is likely 2-3 µs. 8 * 300 cycles / 2500 cycles = ~1 active warp per cycle on 1 SM.

The launch overhead, work distribution overhead, and the time to wait for each's warp store to clear the write data buffer do not count as time that the 8 warps were active. If you increase the work per warp the value will increase to be close to 8 which is the maximum achievable given the number of threads launched. If you also increase the grid size to saturate the device you should be able to get close to 64 average active warps per SM.

Upvotes: 1

Related Questions