Why CUDA shared memory is slower than global memory in tiled matrix multiplication?

Question

I have tiled matrix multiplication code with and without shared memory. Below is matrix multiplication using global memory:

__global__ 
void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
int Row = blockIdx.y*blockDim.y + threadIdx.y;
int Col = blockIdx.x*blockDim.x + threadIdx.x;
if ((Row < Width) && (Col < Width)) {
    float Pvalue = 0;
    for (int k = 0; k < Width; ++k)
    {
        Pvalue += M[Row*Width + k] * N[k*Width + Col];
    }
    
    P[Row*Width + Col] = Pvalue;
}
}

Below is matrix multiplication using shared memory:

__global__
void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)
{
__shared__ float Mds[blockWidth][blockWidth];
__shared__ float Nds[blockWidth][blockWidth];
int tx = threadIdx.x; int ty = threadIdx.y;
int bx = blockIdx.x; int by = blockIdx.y;

int row = by * blockWidth + ty;
int col = bx * blockWidth + tx;
float pvalue = 0;

for (int m = 0; m < Width / blockWidth; ++m)
{
    Mds[ty][tx] = d_M[row * Width + m*blockWidth + tx];
    Nds[ty][tx] = d_N[(m*blockWidth + ty)*Width + col];
    __syncthreads();
    for (int k = 0; k < blockWidth; ++k)
    {
        pvalue += Mds[ty][k]*Nds[k][tx];
    }
    __syncthreads();
}
d_P[row*Width + col] = pvalue;
}

As much as I know using shared memory should be faster but in comparing this two codes I found code without shared memory runs about 2 seconds faster for 1600x1600 matrices. Is there any explanation for this speed difference or something goes wrong with my code?

My teacher uses "Programming Massively Parallel Processors" Book as main text resource these two codes comes from that.

Edit:

Launch-configuration for kernel:

int NumBlocks =ceil( Width / blockWidth);  // blockWidth = 16
dim3 dimGrid(NumBlocks, NumBlocks,1); // Width = 1600
dim3 dimBlock(blockWidth, blockWidth,1);
clock_t startGpuCalculation = clock();
MatrixMulKernel <<>>(d_M, d_N, d_P, Width);
cudaThreadSynchronize();
clock_t endGpuCalculation = clock();

Nasser Kurd · Accepted Answer

I was Running Project In Debug Mode (VS 2017 & CUDA 9). I Run Code in Release Mode and Shared Memory Is Much Faster Than Global Memory. My Bad.

Why CUDA shared memory is slower than global memory in tiled matrix multiplication?

Edit:

Answers (1)

Related Questions