CUDA, low performance in storing data in shared memroy

Question

In order to speed up my project, I want to store a value which was generated inside a kernel into shared memory. However, I found it takes such a long time to store that value. If I remove THIS LINE (see code below), it is very fast to store that value (100 times speed-up!).

extern __shared__ int sh_try[];

__global__ void xxxKernel (...)
{
  float v, e0, e1;
  float t;
  int count(0);
  for (...)
  {
     v = fetchTexture();
     e0 = fetchTexture();
     e1 = fetchTexture();
     t = someDeviceFunction(v, e0, e1);
     if (t>0.0 && t < 1.0)  <========== 
       count++;
  }
  sh_try[threadIdx.x] = count;
}

main()
{
  sth..
  START TIMING:

  xxxKernel<<>> (...);
  
  cudaDeviceSynchronize();

  END TIMING.
  sth...
}

In order to figure out this problem, I simplified my code that just stores the data into shared memory and stop. As I know shared memory is the most efficient memory besides registers, I wonder if this high latency is normal or if I've done something wrong. Please give me some advice! Thank you guys in advance!

trudi

Update: When I replace shared memory with global memory, it takes almost the same amount of time, 33ms without THIS LINE, 297ms with it. Is it normal that storing data to global memory takes the same amount of time as storing to shared memory? Is that also a 'compiler optimization'?

I have also checked other, similar problems on StackOverflow, i.e., there is a huge time gap between storing data into shared memory or not, which may be caused by compiler optimization, since it is pointless to calculate data but not store it, so the compiler just 'removed' that pointless code.

I am not sure if I share the same reason, since the line changes the game is a hypothesis - THIS LINE, when i comment it out, the variable count increases in every iteration, when I uncomment it, it increases when t is meaningful.

Any ideas? Please...

Robert Crovella · Accepted Answer

Frequently, when large performance changes are seen as a result of relatively small code changes (such as adding or deleting a line of code in a kernel), the performance changes are not due to the actual performance impact of that line of code, but are due to the compiler making different optimization decisions, which can result in wholesale additions or deletions of machine code in your kernels.

A relatively easy way to help confirm this is to look at the generated machine code. For example, if the size of the generated machine code changes substantially due to the addition or deletion of a single line of source code, it may be the case that the compiler made an optimization decision that drastically affected the code.

Although it's not machine code, for these purposes a reasonable proxy is to look at the generated PTX code, which is an intermediate code that the compiler creates.

You can generated ptx by simply adding the -ptx switch to your compile command:

nvcc -ptx mycode.cu

This will generate a file called mycode.ptx which you can inspect. Naturally if your regular compile command requires extra switches (e.g -I/path/to/include/files) then this command may require those same switches. The nvcc manual provides more information on code generation options, and there is a PTX manual to help you learn about PTX, but you may be able to get a rough idea just based on the size of the generated PTX (e.g. number of lines in the .ptx file).

CUDA, low performance in storing data in shared memroy

Answers (1)

Related Questions