Reducing Shared Memory Bank Conflicts

Question

Nvprof reported that there are about 200 milion shared_ld_bank_conflict and some shared_st_bank_conflict in my sgemm kernel. I tried the padding trick __shared__ float smem[SIZE + OFFSET];, it reduced store bank conflicts to 0, but load bank conflicts are still there. I don't know how to further improve it.

__global__ void sgemm(
  const float* __restrict__ A,
  const float* __restrict__ B,
  float* __restrict__ C,
  int M, int N, int K
){
  int tid = threadIdx.x;
  int gStartx = blockIdx.x * 128;
  int gStarty = blockIdx.y * 128;

  int dx = tid % 8;
  int dy = tid / 8;
  int vx = tid % 16;
  int vy = tid / 16;

  __shared__ volatile float aSM[8][128+4];
  __shared__ volatile float bSM[8][128+4];
  float aBuffer1[4];
  float bBuffer1[4];
  float aBuffer2[4];
  float bBuffer2[4];

  float cCache[8][8];
#pragma unroll
  for (int i=0; i<8; i++) 
#pragma unroll
    for (int j=0; j<8; j++)
      cCache[i][j] = 0.f;

//load first two tiles
#pragma unroll
  for (int i=0; i<4; i++){
    aBuffer1[i] = A[(gStarty + dy + i*32)*K + (dx)];
    bBuffer1[i] = B[(gStartx + dy + i*32)*K + (dx)];
  }
  int nIt = (K + 8 - 1) / 8;
#pragma unroll
  for (int itr=0; itr


A (2048x2048) matrix is row major, B (2048x2048) is column major, each block has 256 threads, each block calculates 128x128 portion of C, and each thread calculates 8x8x8. the gpu is Tesla P100.

Reducing Shared Memory Bank Conflicts

Answers (1)

Related Questions