cuda : Is shared memory always helpful?

Question

When I read the programming guide, I got the feeling that shared memory will always improve the performance, but it seems not. I have two functions:

const int Ntimes=1;

__global__ void testgl(float *A, float *C, int numElements){

    int ti = threadIdx.x;
    int b0 = blockDim.x*blockIdx.x;

    if (b0+ti < numElements){
        for(int i=0;i>>(d_A, d_C, numElements);

    testsh<<>>(d_B, d_D, numElements);

    // Copy the device resultto the host 
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    cudaMemcpy(h_D, d_D, size, cudaMemcpyDeviceToHost);


    // Free device global memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    cudaFree(d_D);

    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);
    free(h_D);

    // Reset the device and exit
    cudaDeviceReset();

    return 0;
}

If Ntimes is set to be 1, testgl costs 49us, and testsh costs 97us. If Ntimes is set to be 100, testgl costs 9.7ms, and testsh costs 8.9ms.

I do not know why it's more than 100 times longer.

So it seems the shared memory helps only when we want to do a lot of things in device, is that right?

The card used here is GTX680.

Thanks in advance.

Sagar Masuti · Accepted Answer

shared memory will always improve the performance

Thats not true. It depends on the algorithm. If you have a perfectly coalesced memory access in the kernel and you are accessing the global memory just once it may not help. But if you are implementing suppose a matrix multiplication where you need the partial sums to be held then it will be useful.

It will be also helpful if you are accessing the same memory location more than once in the kernel it will help in this case since the shared memory latency is 100 times less than the global memory because its on-chip memory.

When you analyse that the kernel is bandwidth limited then its a good place to think if there is a scope of using the shared memory and increase the performance. Its also better strategy to check the occupancy calculator to check if the usage of shared memory is going to affect the occupancy.

shared memory helps only when we want to do a lot of things in device ?

Partial Yes. Shared memory helps when we want to do a lot of things in device.

In your case in the above kernel, as you are accessing the global memory more than once in the kernel it should help. It will be helpful if you can provide the complete reproducer to analyze the code. Also it will be helpful to know the card details you are running on.

cuda : Is shared memory always helpful?

Answers (1)

Related Questions