Reputation: 564
This question is about C++ optimisation techniques. I have a matrix-vector multiplication with large dimensions and would like to reduce the runtime. I know that there are specialised libraries for linear algebra stuff, but I would actually like to learn a bit about the underlying processor peculiarities. So far, I am compiling with \O2 (Microsoft) and I got the compiler to confirm that the inner loop of the multiplication is vectorised.
The example code is:
#include <stdio.h>
#include <ctime>
#include <iostream>
#define VEC_LENGTH 64
#define ITERATIONS 4000000
void gen_vector_matrix_multiplication(double *vec_result, double *vec_a, double *matrix_B, unsigned int cols_B, unsigned int rows_B)
{
// initialise result vector
for (unsigned int i = 0; i < rows_B; i++)
{
vec_result[i] = 0;
}
// perform multiplication
for (unsigned int j = 0; j < cols_B; j++)
{
const double entry = vec_a[j];
const int col = j*rows_B;
for (unsigned int i = 0; i < rows_B; i++)
{
vec_result[i] += entry * matrix_B[i + col];
}
}
}
int main()
{
double *vec_a = new double[VEC_LENGTH];
double *vec_result = new double[VEC_LENGTH];
double *matrix_B = new double[VEC_LENGTH*VEC_LENGTH];
// start clock
clock_t begin = clock();
// this outer loop is just for test purposes so that the timing becomes meaningful
for (unsigned int i = 0; i < ITERATIONS; i++)
{
gen_vector_matrix_multiplication(vec_result, vec_a, matrix_B, VEC_LENGTH, VEC_LENGTH);
}
// stop clock
double elapsed_time = static_cast<double>(clock() - begin) / CLOCKS_PER_SEC;
std::cout << elapsed_time/(VEC_LENGTH*VEC_LENGTH) << std::endl;
delete[] vec_a;
delete[] vec_result;
delete[] matrix_B;
return 1;
}
The multiplication is done several times to get a reliable estimate of the runtime. I have measured the runtime for a number of different vector length (in this example there is only one number of elements N
, which is the length of the vector and at the same time defines the size of the matrix NxN
) and normalised the measured runtime to the number of elements.
You can see that for small enough N
, the run-time per operation is constant. However, above N=512
the runtime jumps up. The difference between blue and red data points is the load on the processor. If the example program is running pretty much alone, the runtime is given by the blue dots and when the other cores are busy, the time is represented by the red dots.
I now have a couple of questions related to this.
N=512
and N=1024
, is related to the size of the L3 cache of my processor (Ivy Bridge i5-3570) which should be 6MB? 512*512*8byte
equals roughly 2MB and 1024*1024*8byte
is roughly 8MB. So that the matrix does not fit into the cache any more and thus fetching data from RAM is the reason for the longer execution time?N>1024
?I am curious to hear your thoughts. Thanks!
Upvotes: 4
Views: 838
Reputation: 11910
For normalization, I chose
elapsed_time/(VEC_LENGTH*VEC_LENGTH*ITERATIONS)
and it started with 6 nanoseconds and ended with 7 nanoseconds from N=64 to N=8192 where
ITERATIONS=20
and only cached thing for all cases is "vec_a" so only matrix elements are read from memory for large matrices.
Memory bandwidth is around 20 GB/s which means more than 2 G doubles per second. Core frequency is 3.7 GHz so this would be 3.7 G multiplications at max.
Core can issue 3.7 G doubles per second but memory feeds 2 G per second meaning.
Of course this is only for 64-bit fp operation. There is also
i + col
that must be done before multiplication so this is a serial execution. 2 instructions at 3.7 GHz means nearly 1.8 G / second effectively. Close to 2. Even if cache does its job, cpu core lacks compute power for this serial code.
Same thing happened when unrolled the loop by 4. This reduced time to half!. Now it is 3.4 nanoseconds per operation but for all N values because there is still 2 instuctions(1 integer and 1 floating point) after 1 unit of memory bandwidth that cpu must do.
Edit: Using all cores would surpass memory bandwidth and make effect of L3 cache more discernable.
Upvotes: 1
Reputation: 6194
An important aspect of optimizing such codes is taking care of aliasing and vectorization, and your post suggests that you already took care of the latter. Quite often the compiler needs a bit of help. On GCC 5.3.0 the runtime reduces considerably using the loop below. The __restrict__
qualifier tells the compiler that there is no aliasing possible, the #pragma GCC ivdep
tells the GCC compiler that it is OK to vectorize the code. Furthermore, the compiler flags are extremely important as well. I compiled the code using g++ -O3 -march=native -mtune=native matrix_example.cxx
.
void gen_vector_matrix_multiplication(double* const __restrict__ vec_result,
const double* const __restrict__ vec_a,
const double* const __restrict__ matrix_B,
const int cols_B,
const int rows_B)
{
// initialise result vector
#pragma GCC ivdep
for (int i = 0; i < rows_B; i++)
vec_result[i] = 0;
// perform multiplication
for (int j = 0; j < cols_B; j++)
{
const double entry = vec_a[j];
const int col = j*rows_B;
#pragma GCC ivdep
for (int i = 0; i < rows_B; i++)
{
vec_result[i] += entry * matrix_B[i + col];
}
}
}
Upvotes: 2