Reputation: 41
When I read the programming guide, I got the feeling that shared memory will always improve the performance, but it seems not. I have two functions:
const int Ntimes=1;
__global__ void testgl(float *A, float *C, int numElements){
int ti = threadIdx.x;
int b0 = blockDim.x*blockIdx.x;
if (b0+ti < numElements){
for(int i=0;i<Ntimes;i++){
A[b0+ti]=A[b0+ti]*A[b0+ti]*10-2*A[b0+ti]+1;
}
C[b0+ti] = A[b0+ti]*A[b0+ti];
}
}
__global__ void testsh(float *A, float *C, int numElements){
int ti = threadIdx.x;
int b0 = blockDim.x*blockIdx.x;
__shared__ float a[1024];
if (b0+ti < numElements){
a[ti]=A[b0+ti];
}
__syncthreads();
if (b0+ti < numElements){
for(int i=0;i<Ntimes;i++){
a[ti]=a[ti]*a[ti]*10-2*a[ti]+1;
}
C[b0+ti] = a[ti]*a[ti];
}
}
int main(void){
int numElements = 500000;
size_t size = numElements * sizeof(float);
// Allocate the host input
float *h_A = (float *)malloc(size);
float *h_B = (float *)malloc(size);
// Allocate the host output
float *h_C = (float *)malloc(size);
float *h_D = (float *)malloc(size);
// Initialize the host input
for (int i = 0; i < numElements; i++){
h_A[i] = rand()/(float)RAND_MAX;
h_B[i] = h_A[i];
}
// Allocate the device input
float *d_A = NULL; cudaMalloc((void **)&d_A, size);
float *d_B = NULL; cudaMalloc((void **)&d_B, size);
float *d_C = NULL; cudaMalloc((void **)&d_C, size);
float *d_D = NULL; cudaMalloc((void **)&d_D, size);
//Copy to Device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 1024;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
testgl<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_C, numElements);
testsh<<<blocksPerGrid, threadsPerBlock>>>(d_B, d_D, numElements);
// Copy the device resultto the host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaMemcpy(h_D, d_D, size, cudaMemcpyDeviceToHost);
// Free device global memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaFree(d_D);
// Free host memory
free(h_A);
free(h_B);
free(h_C);
free(h_D);
// Reset the device and exit
cudaDeviceReset();
return 0;
}
If Ntimes
is set to be 1, testgl
costs 49us, and testsh
costs 97us.
If Ntimes
is set to be 100, testgl
costs 9.7ms, and testsh
costs 8.9ms.
I do not know why it's more than 100 times longer.
So it seems the shared memory helps only when we want to do a lot of things in device, is that right?
The card used here is GTX680.
Thanks in advance.
Upvotes: 3
Views: 347
Reputation: 1301
shared memory will always improve the performance
Thats not true. It depends on the algorithm. If you have a perfectly coalesced memory access in the kernel and you are accessing the global memory just once it may not help. But if you are implementing suppose a matrix multiplication where you need the partial sums to be held then it will be useful.
It will be also helpful if you are accessing the same memory location more than once in the kernel it will help in this case since the shared memory latency is 100 times less than the global memory because its on-chip memory.
When you analyse that the kernel is bandwidth limited then its a good place to think if there is a scope of using the shared memory and increase the performance. Its also better strategy to check the occupancy calculator to check if the usage of shared memory is going to affect the occupancy.
shared memory helps only when we want to do a lot of things in device ?
Partial Yes. Shared memory helps when we want to do a lot of things in device.
In your case in the above kernel, as you are accessing the global memory more than once in the kernel it should help. It will be helpful if you can provide the complete reproducer to analyze the code. Also it will be helpful to know the card details you are running on.
Upvotes: 2