CUDA: Writing to Global memory slow when in if statement

Question

I have two kernels:

template 
__global__ void bpcKernel(T* finalOutputPtr, const T* heatMapPtr, const T* peaksPtrA, const T* peaksPtrB, const unsigned int* bodyPartPairsPtr, const unsigned int* mapIdxPtr, const int POSE_MAX_PEOPLE, const int TOTAL_BODY_PARTS, const int heatmapWidth, const int heatmapHeight)
{
    const auto i = (blockIdx.x * blockDim.x) + threadIdx.x;
    const auto j = (blockIdx.y * blockDim.y) + threadIdx.y;
    const auto k = (blockIdx.z * blockDim.z) + threadIdx.z;

    const T* bodyPartA = peaksPtrA + (bodyPartPairsPtr[i*2]*POSE_MAX_PEOPLE*3 + j*3);
    const T* bodyPartB = peaksPtrB + (bodyPartPairsPtr[i*2 + 1]*POSE_MAX_PEOPLE*3 + k*3);

    finalOutputPtr[i*POSE_MAX_PEOPLE*POSE_MAX_PEOPLE + j*POSE_MAX_PEOPLE + k] = -1;

    if(bodyPartA[2] >= 0.05 && bodyPartB[2] >= 0.05){
        //finalOutputPtr[i*POSE_MAX_PEOPLE*POSE_MAX_PEOPLE + j*POSE_MAX_PEOPLE + k] = -1;
    }
}

This one computes an if statement, but all threads write to the finalOutputPtr

    template 
__global__ void bpcKernel(T* finalOutputPtr, const T* heatMapPtr, const T* peaksPtrA, const T* peaksPtrB, const unsigned int* bodyPartPairsPtr, const unsigned int* mapIdxPtr, const int POSE_MAX_PEOPLE, const int TOTAL_BODY_PARTS, const int heatmapWidth, const int heatmapHeight)
{
    const auto i = (blockIdx.x * blockDim.x) + threadIdx.x;
    const auto j = (blockIdx.y * blockDim.y) + threadIdx.y;
    const auto k = (blockIdx.z * blockDim.z) + threadIdx.z;

    const T* bodyPartA = peaksPtrA + (bodyPartPairsPtr[i*2]*POSE_MAX_PEOPLE*3 + j*3);
    const T* bodyPartB = peaksPtrB + (bodyPartPairsPtr[i*2 + 1]*POSE_MAX_PEOPLE*3 + k*3);

    //finalOutputPtr[i*POSE_MAX_PEOPLE*POSE_MAX_PEOPLE + j*POSE_MAX_PEOPLE + k] = -1;

    if(bodyPartA[2] >= 0.05 && bodyPartB[2] >= 0.05){
        finalOutputPtr[i*POSE_MAX_PEOPLE*POSE_MAX_PEOPLE + j*POSE_MAX_PEOPLE + k] = -1;
    }
}

This thread does the same operation, but only writes when those two conditions are satisfied.

But for some reason, the 2nd kernel takes 6 more ms to compute. Its almost 4 times slower. Why is this the case?

Michael Kenzel · Accepted Answer

Albeit the difference in code may seem minor, the two kernels you have here perform some very different computations if you think about it. The first kernel just uniformly fills a buffer with -1 (the compiler can and will just optimize away the loads from bodyPartPairsPtr since no observable behavior depends on their result). The second kernel loads two unsigned int from memory which are then used as an offset to load two further values, depending on which it will write or not write a -1 to the buffer. So while the first kernel just performs a single, potentially perfectly coalesced, store, the second kernel performs four loads and a dependent store. And that is ignoring details such as that it will also need two additional constant memory loads to fetch the additional kernel parameters, which are not used in the first kernel. From that perspective, it should be no surprise that the second kernel is slower; it simply produces a lot more memory transfer.

As always with performance questions, there is only one way to find the answer: profiling. But if you go ahead and profile your kernel, I would expect you to find it limited by memory transfer. And most likely you will see exactly an about 4× difference in memory transfer which will explain your results…

CUDA: Writing to Global memory slow when in if statement

Answers (1)

Related Questions