Matrix vector multiplication optimisation - Cache size

Question

This question is about C++ optimisation techniques. I have a matrix-vector multiplication with large dimensions and would like to reduce the runtime. I know that there are specialised libraries for linear algebra stuff, but I would actually like to learn a bit about the underlying processor peculiarities. So far, I am compiling with \O2 (Microsoft) and I got the compiler to confirm that the inner loop of the multiplication is vectorised.

The example code is:

#include 
#include 
#include 

#define VEC_LENGTH 64
#define ITERATIONS 4000000

void gen_vector_matrix_multiplication(double *vec_result, double *vec_a, double *matrix_B, unsigned int cols_B, unsigned int rows_B)
{
    // initialise result vector
    for (unsigned int i = 0; i < rows_B; i++)
    {
        vec_result[i] = 0;
    }
    // perform multiplication
    for (unsigned int j = 0; j < cols_B; j++)
    {
        const double entry = vec_a[j];
        const int col = j*rows_B;

        for (unsigned int i = 0; i < rows_B; i++)
        {
            vec_result[i] += entry * matrix_B[i + col];
        }
    }
}

int main()
{
    double *vec_a = new double[VEC_LENGTH];
    double *vec_result = new double[VEC_LENGTH];
    double *matrix_B = new double[VEC_LENGTH*VEC_LENGTH];

    // start clock
    clock_t begin = clock();

    // this outer loop is just for test purposes so that the timing becomes meaningful
    for (unsigned int i = 0; i < ITERATIONS; i++)
    {
        gen_vector_matrix_multiplication(vec_result, vec_a, matrix_B, VEC_LENGTH, VEC_LENGTH);
    }

    // stop clock
    double elapsed_time = static_cast(clock() - begin) / CLOCKS_PER_SEC;
    std::cout << elapsed_time/(VEC_LENGTH*VEC_LENGTH) << std::endl;

    delete[] vec_a;
    delete[] vec_result;
    delete[] matrix_B;

    return 1;
}

The multiplication is done several times to get a reliable estimate of the runtime. I have measured the runtime for a number of different vector length (in this example there is only one number of elements N, which is the length of the vector and at the same time defines the size of the matrix NxN) and normalised the measured runtime to the number of elements.

You can see that for small enough N, the run-time per operation is constant. However, above N=512 the runtime jumps up. The difference between blue and red data points is the load on the processor. If the example program is running pretty much alone, the runtime is given by the blue dots and when the other cores are busy, the time is represented by the red dots.

I now have a couple of questions related to this.

Am I correct in assuming that the jump between N=512 and N=1024, is related to the size of the L3 cache of my processor (Ivy Bridge i5-3570) which should be 6MB? 512*512*8byte equals roughly 2MB and 1024*1024*8byte is roughly 8MB. So that the matrix does not fit into the cache any more and thus fetching data from RAM is the reason for the longer execution time?
What is the reason behind the steady increase in runtime above this threshold?
Any thoughts on the reason for the curves of busy and idle processor being so different above the threshold?
What would be logical next steps in optimising this multiplication routine for operation with N>1024?

I am curious to hear your thoughts. Thanks!

huseyin tugrul buyukisik · Accepted Answer

For normalization, I chose

elapsed_time/(VEC_LENGTH*VEC_LENGTH*ITERATIONS)

and it started with 6 nanoseconds and ended with 7 nanoseconds from N=64 to N=8192 where

ITERATIONS=20

and only cached thing for all cases is "vec_a" so only matrix elements are read from memory for large matrices.

Memory bandwidth is around 20 GB/s which means more than 2 G doubles per second. Core frequency is 3.7 GHz so this would be 3.7 G multiplications at max.

Core can issue 3.7 G doubles per second but memory feeds 2 G per second meaning.

Of course this is only for 64-bit fp operation. There is also

 i + col

that must be done before multiplication so this is a serial execution. 2 instructions at 3.7 GHz means nearly 1.8 G / second effectively. Close to 2. Even if cache does its job, cpu core lacks compute power for this serial code.

Same thing happened when unrolled the loop by 4. This reduced time to half!. Now it is 3.4 nanoseconds per operation but for all N values because there is still 2 instuctions(1 integer and 1 floating point) after 1 unit of memory bandwidth that cpu must do.

Edit: Using all cores would surpass memory bandwidth and make effect of L3 cache more discernable.

Matrix vector multiplication optimisation - Cache size

Answers (2)

Related Questions