JAN
JAN

Reputation: 21865

Loop unrolling for multiplying two matrices NxN?

I'm trying to figure out a good Loop unrolling for multiplying two matrices .

For example if we wanted to Sum a NxN matrix :

void SumMatrix(int *M, int n, int *result) 
{ 
  int  i,j; 

  *result = 0; 
  for (i=0; i<n; i++) 
    for (j=0; j<n; j++) 
      *result += M[j][i]; 
}

We can do this :

void SumMatrix(int *M, int n, int *result) 
{ 
    int  i; 
    int  size = n*n; 
    int  last = size%8; 
    int  acc1 = 0; 
    int  acc2 = 0; 
    int  *pEnd = M+size-last; 

    for (; M<pEnd; M+=8) 
    { 
      acc1 += (*M + *(M+1)) + (*(M+2) + *(M+3));
      acc2 += (*(M+4) + *(M+5)) + (*(M+6) + *(M+7));
    } 

    /* adding the last entries */ 
    while (last--)  
        acc1 += *(M++); 

    *result = acc1+acc2;        
}

But I've tried to find a (GOOD) way to multiply 2 matrices , however found none at the moment .

Remark : this is no homework task , I have an exam today and just thought about this question , I think it could be a fine question for an exam , don't you ?

I'd appreciate any help

Regards

Ron

Upvotes: 0

Views: 4296

Answers (2)

parallelgeek
parallelgeek

Reputation: 448

Look how complex the ATLAS project is, which provides an optimized version of a BLAS library (based primarily on matrix multiplication). It should consider not only thread-level parallelism, but the memory hierarchy (not only unrolling, but cache tiling and register tiling, software pipelining and so forth). It's usually written by hands or optimized by the "auto-tuner", like ATLAS. In case you want to unravel thread-level parallelism, you'd better to use a "tiled algorithm" and spread the resulting tiles computation between your threads.

Upvotes: 0

Vanwaril
Vanwaril

Reputation: 7548

Most compilers will do the unrolling for you (you might need to turn on a flag, or set it to an optimization level - I believe -funroll-loops does it for gcc).

Also, with your question, the fact that it is a 2D matrix doesn't matter, since you are adding all the numbers up. If you are limited to a single process/thread, adding the numbers up sequentially will be the fastest because that has optimal caching performance. You might get some benefit out of SSE or vector instructions; again, today's compilers can do these for you with such a simple problem.

Upvotes: 2

Related Questions