Hua Ye
Hua Ye

Reputation: 41

cuda : Is shared memory always helpful?

When I read the programming guide, I got the feeling that shared memory will always improve the performance, but it seems not. I have two functions:

const int Ntimes=1;

__global__ void testgl(float *A, float *C, int numElements){

    int ti = threadIdx.x;
    int b0 = blockDim.x*blockIdx.x;

    if (b0+ti < numElements){
        for(int i=0;i<Ntimes;i++){
            A[b0+ti]=A[b0+ti]*A[b0+ti]*10-2*A[b0+ti]+1;
        }
        C[b0+ti] = A[b0+ti]*A[b0+ti];
    }
}


__global__ void testsh(float *A, float *C, int numElements){

    int ti = threadIdx.x;
    int b0 = blockDim.x*blockIdx.x;

    __shared__ float a[1024];

    if (b0+ti < numElements){
        a[ti]=A[b0+ti];
    }

    __syncthreads();

    if (b0+ti < numElements){
        for(int i=0;i<Ntimes;i++){
            a[ti]=a[ti]*a[ti]*10-2*a[ti]+1;
        }
        C[b0+ti] = a[ti]*a[ti];
    }
}

int main(void){

    int numElements = 500000;
    size_t size = numElements * sizeof(float);

    // Allocate the host input
    float *h_A = (float *)malloc(size);
    float *h_B = (float *)malloc(size);

    // Allocate the host output
    float *h_C = (float *)malloc(size);
    float *h_D = (float *)malloc(size);


    // Initialize the host input
    for (int i = 0; i < numElements; i++){
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = h_A[i];
    }

    // Allocate the device input
    float *d_A = NULL; cudaMalloc((void **)&d_A, size);
    float *d_B = NULL; cudaMalloc((void **)&d_B, size);
    float *d_C = NULL; cudaMalloc((void **)&d_C, size);
    float *d_D = NULL; cudaMalloc((void **)&d_D, size);


    //Copy to Device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);  
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);


    // Launch the Vector Add CUDA Kernel
    int threadsPerBlock = 1024;
    int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
    
    testgl<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_C, numElements);

    testsh<<<blocksPerGrid, threadsPerBlock>>>(d_B, d_D, numElements);

    // Copy the device resultto the host 
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    cudaMemcpy(h_D, d_D, size, cudaMemcpyDeviceToHost);


    // Free device global memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    cudaFree(d_D);

    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);
    free(h_D);

    // Reset the device and exit
    cudaDeviceReset();

    return 0;
}

If Ntimes is set to be 1, testgl costs 49us, and testsh costs 97us. If Ntimes is set to be 100, testgl costs 9.7ms, and testsh costs 8.9ms.

I do not know why it's more than 100 times longer.

So it seems the shared memory helps only when we want to do a lot of things in device, is that right?

The card used here is GTX680.

Thanks in advance.

Upvotes: 3

Views: 347

Answers (1)

Sagar Masuti
Sagar Masuti

Reputation: 1301

shared memory will always improve the performance

Thats not true. It depends on the algorithm. If you have a perfectly coalesced memory access in the kernel and you are accessing the global memory just once it may not help. But if you are implementing suppose a matrix multiplication where you need the partial sums to be held then it will be useful.

It will be also helpful if you are accessing the same memory location more than once in the kernel it will help in this case since the shared memory latency is 100 times less than the global memory because its on-chip memory.

When you analyse that the kernel is bandwidth limited then its a good place to think if there is a scope of using the shared memory and increase the performance. Its also better strategy to check the occupancy calculator to check if the usage of shared memory is going to affect the occupancy.

shared memory helps only when we want to do a lot of things in device ?

Partial Yes. Shared memory helps when we want to do a lot of things in device.

In your case in the above kernel, as you are accessing the global memory more than once in the kernel it should help. It will be helpful if you can provide the complete reproducer to analyze the code. Also it will be helpful to know the card details you are running on.

Upvotes: 2

Related Questions