Matrix multiply fastest with -O0

Question

I timed a fairly naive BLAS-like matrix multiplication (DGEMM) function:

void dgemm_naive(const int M, const int N, const int K, const double alpha,
                 const double *A, const int lda, const double *B, const int ldb,
                 const double beta, double *C, const int ldc) {
  for (int i = 0; i < M; ++i) {
    for (int j = 0; j < N; ++j) {
      double cij = beta * C[ldc * i + j];
      double abij{0.0};

      for (int k = 0; k < K; ++k) {
        abij += A[lda * i + k] * B[ldb * k + j];
      }

      C[ldc * i + j] = cij + alpha * abij;
    }
  }
}

With the following boilerplate and timer:

#include 
#include 
#include 

void dgemm_naive(const int M, const int N, const int K, const double alpha,
                 const double *A, const int lda, const double *B, const int ldb,
                 const double beta, double *C, const int ldc);

template  inline T min(T x, T y) {
  return (((x) < (y)) ? (x) : (y));
}


void run_test(const int m, const int n, const int k, double alpha,
                 double beta,
                 void (*dgemm)(const int, const int, const int, const double,
                               const double *, const int, const double *,
                               const int, const double, double *, const int),
                 const int niter) {

  double *A = new (std::nothrow) double[m * k];
  double *B = new (std::nothrow) double[k * n];
  double *C = new (std::nothrow) double[m * n];

  if (A == nullptr || B == nullptr ||
      C == nullptr) // handle case where new returned null
  {
    // Do error handling here
    printf("Could not allocate memory
");
  }

  for (int i = 0; i < (m * k); i++) {
    A[i] = i + 1;
  }

  for (int i = 0; i < (k * n); i++) {
    B[i] = -i - 1;
  }

  for (int i = 0; i < (m * n); i++) {
    C[i] = 0.0;
  }

  auto t1 = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < niter; i++) {
    dgemm(m, n, k, alpha, A, k, B, n, beta, C, n);
  }
  auto t2 = std::chrono::high_resolution_clock::now();

  auto ms_int = std::chrono::duration_cast(t2 - t1);

  // // To make sure computations don't get optimized out. 
  // printf("C[0, 0] = %f
", C[0]);
  // printf("C[M-1, N-1] = %f
", C[(m-1)*n+(n-1)]);
  
  delete[] A;
  delete[] B;
  delete[] C;

  printf("Time: %ld ms
", static_cast(ms_int.count()));

}

int main() {
  int m = 1024, k = 1024, n = 1024;
  double alpha = 1.0, beta = 0.0;
  run_test(m, n, k, alpha, beta, dgemm_naive, 1);
  return 0;
}

When compiling with the default or clang (18.1.3) on my AMD Ryzen 7 3800XT, I get the following timings:

-O0 Time: 30120 ms
-O1 Time: 54488 ms

And similar results for gcc (13.3.0) and on an intel laptop (similar relative difference, though slower overall: i7-8550U) I would of course expect a better time with -O1.

Both cachegrind and perf suggest that -O1 introduces an issue with cache misses. From studying the assembly and llvm IR in godbolt, I gather that the order of operations was not changed. Mostly, the loop-invariant parts of index calculations get moved out of the loop. At O1, the only loads that remain are the three reads to C, A, B respectively, and the loop order remains the same (as is expected at O1).

Of course there is a lot more (mainly parallelization, vectorization and tiling) that can still be done to improve the code, and at the issue presented here may very well vanish as the code is refined. However, I am interested in understanding what happens here, since I am trying to improve my performance engineering skills.

So I am interested in answers to:

Is my assumption right that cache misses would have to originate from a changed order of memory loads? (I did also play with alignment, and performance counters indicate no issue there.)
What could be a next step to investigate?

Edit:

Fixed memory leak and return type
Removed static_cast as suggested in comments
Added static_cast in the printf statement for portability: This is a long for me but not for someone in the comments. Note that Luckily, it is clear from wall clock time that the timing is actually correct and not an overflow ;)

Matrix multiply fastest with -O0

Answers (0)

Related Questions