user3331951
user3331951

Reputation: 29

SpeedUp Matrix Multiplication

I have to do a matrix boolean multiplication of a matrix with itself in a C++ program and I want to optimize it. The matrix is symmetric so I think to do a row by row multiplication to reduce cache misses. I allocated space for matrix in this way:

matrix=new bool*[dimension];
for (i=0; i<dimension; i++) {
    matrix[i]=new bool[dimension];
}

And the multiplication is the following:

for (m=0; m<dimension; m++) {
    for (n=0; n<dimension; n++) {
         for (k=0; k<dimension; k++) {
                temp=mat[m][k] && mat[n][k];
                B[m][n]= B[m][n] || temp;
...

I did some test of computation time with this version and with another version whit a row by column multiplication like this

for (m=0; m<dimension; m++) {
    for (n=0; n<dimension; n++) {
         for (k=0; k<dimension; k++) {
                temp=mat[m][k] && mat[k][n];
                B[m][n]= B[m][n] || temp;
...

I did tests on a 1000x1000 matrix The result showed that the second version ( row by column ) is faster the previous one. Could you show me why? Shouldn't The misses in the first algorithm be less ?

Upvotes: 1

Views: 377

Answers (2)

user1225999
user1225999

Reputation: 920

In the first multiplication approach the rows of the boolean matrices are stored consecutively in memory and also accessed consecutively so that prefetching works flawlessly. In the second approach the cacheline fetched when accessing the element (n,0) can already be evicted when accessing (n+1,0). Whether this actually happens depends on the architecture and its cache hierarchy properties you run your code on. On my machine the first approach is indeed faster for large enough matrices.

As for speeding up the computations: Do not use logical operators since they are evaluated lazy and thus branch misprediction can occur. The inner loop can be exited early as soon as B[m][n] becomes true. Instead of using booleans you might want to consider using the bits of say integers. That way you can combine 32 or 64 elements in your inner loop at once and possibly use vectorization. If your matrices are rather sparse then you might want to consider switching to sparse matrix data structures. Also changing the order of the loops can help as well as introducing blocking. However, any performance optimization is specific to an architecture and class of input matrices.

Upvotes: 1

Yotam
Yotam

Reputation: 888

Speeding suggestion. In the inner loop:

Bmn = false;
for (k=0; k<dimension; k++) {
   if ((Bmn = mat[m][k] && mat[k][n])) {
       k = dimension; // exit for-k loop
   }
} 
B[m][n]= Bmn

Upvotes: 1

Related Questions