Why is this statement in a CUDA kernel slow?

Question

I am doing some computer vision stuff using CUDA. Following code takes about 20 seconds to complete.

__global__ void nlmcuda_kernel(float* fpOMul,/*other input args*/){

float fpODenoised[75];

/*Do awesome stuff to compute fpODenoised*/

//inside nested loops:(This is the statement that is the bottleneck in the code.)
      fpOMul[ii * iwl * iwxh + iindex * iwxh + il] = fpODenoised[ii * iwl +iindex];

}

if I replace that statement with

fpOMul[ii * iwl * iwxh + iindex * iwxh + il] = 2.0f;

the code hardly takes a couple of seconds to complete.

Why is the specified statment slow and how can I make it run fast?

Robert Crovella · Accepted Answer

When you make the code change the compiler can see that all your awesome fpdenoised code is no longer needed and can optimize it out. The actual statement you modified is not the direct cause of the perf difference. You can verify this by looking at the ptx or sass code in each case.

Why is this statement in a CUDA kernel slow?

Answers (1)

Related Questions