Reputation: 29
I timed a fairly naive BLAS-like matrix multiplication (DGEMM) function:
void dgemm_naive(const int M, const int N, const int K, const double alpha,
const double *A, const int lda, const double *B, const int ldb,
const double beta, double *C, const int ldc) {
for (int i = 0; i < M; ++i) {
for (int j = 0; j < N; ++j) {
double cij = beta * C[ldc * i + j];
double abij{0.0};
for (int k = 0; k < K; ++k) {
abij += A[lda * i + k] * B[ldb * k + j];
}
C[ldc * i + j] = cij + alpha * abij;
}
}
}
With the following boilerplate and timer:
#include <chrono>
#include <cstdio>
#include <new>
void dgemm_naive(const int M, const int N, const int K, const double alpha,
const double *A, const int lda, const double *B, const int ldb,
const double beta, double *C, const int ldc);
template <typename T> inline T min(T x, T y) {
return (((x) < (y)) ? (x) : (y));
}
void run_test(const int m, const int n, const int k, double alpha,
double beta,
void (*dgemm)(const int, const int, const int, const double,
const double *, const int, const double *,
const int, const double, double *, const int),
const int niter) {
double *A = new (std::nothrow) double[m * k];
double *B = new (std::nothrow) double[k * n];
double *C = new (std::nothrow) double[m * n];
if (A == nullptr || B == nullptr ||
C == nullptr) // handle case where new returned null
{
// Do error handling here
printf("Could not allocate memory\n");
}
for (int i = 0; i < (m * k); i++) {
A[i] = i + 1;
}
for (int i = 0; i < (k * n); i++) {
B[i] = -i - 1;
}
for (int i = 0; i < (m * n); i++) {
C[i] = 0.0;
}
auto t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < niter; i++) {
dgemm(m, n, k, alpha, A, k, B, n, beta, C, n);
}
auto t2 = std::chrono::high_resolution_clock::now();
auto ms_int = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1);
// // To make sure computations don't get optimized out.
// printf("C[0, 0] = %f\n", C[0]);
// printf("C[M-1, N-1] = %f\n", C[(m-1)*n+(n-1)]);
delete[] A;
delete[] B;
delete[] C;
printf("Time: %ld ms\n", static_cast<long>(ms_int.count()));
}
int main() {
int m = 1024, k = 1024, n = 1024;
double alpha = 1.0, beta = 0.0;
run_test(m, n, k, alpha, beta, dgemm_naive, 1);
return 0;
}
When compiling with the default or clang (18.1.3) on my AMD Ryzen 7 3800XT, I get the following timings:
-O0 Time: 30120 ms
-O1 Time: 54488 ms
And similar results for gcc (13.3.0) and on an intel laptop (similar relative difference, though slower overall: i7-8550U) I would of course expect a better time with -O1.
Both cachegrind and perf suggest that -O1 introduces an issue with cache misses. From studying the assembly and llvm IR in godbolt, I gather that the order of operations was not changed. Mostly, the loop-invariant parts of index calculations get moved out of the loop. At O1, the only loads that remain are the three reads to C, A, B respectively, and the loop order remains the same (as is expected at O1).
Of course there is a lot more (mainly parallelization, vectorization and tiling) that can still be done to improve the code, and at the issue presented here may very well vanish as the code is refined. However, I am interested in understanding what happens here, since I am trying to improve my performance engineering skills.
So I am interested in answers to:
Edit:
long
for me but not for someone in the comments. Note that Luckily, it is clear from wall clock time that the timing is actually correct and not an overflow ;)Upvotes: 2
Views: 119