user1472972
user1472972

Reputation: 105

OpenMP matrix multiplication nested loops

This is a matrix multiplication code with one i loop parallelized and another with j loop parallelized. With both the versions the value of C array is correct (I have tested with small matrix sizes). There is also no performance improvement from one over other.

Can anyone please tell me what is the difference in these 2 versions? Will the array C be accurate in both the versions regardless of the size of the matrix? Thanks in advance

void mat_multiply ( void )
{
    int t;
    int i, j, k;    
    #pragma omp parallel for private(k) // parallelize i loop
    for(i = 0; i < dimension; i++)
    {
        for(j = 0; j < dimension; j++) 
        {
            for(k = 0; k < dimension; k++)
            {
                C[dimension*i+j] += A[dimension*i+k] *  B[dimension*k+j];       
            }
        }
    }
 }

 void mat_multiply ( void )
 {
     int t;
     int i, j, k;   

     for(i = 0; i < dimension; i++)
     {
         #pragma omp parallel for private(k) // parallelize j loop
         for(j = 0; j < dimension; j++) 
         {
             for(k = 0; k < dimension; k++)
             {
                 C[dimension*i+j] += A[dimension*i+k] *  B[dimension*k+j];      
             }
         }
     }
 }

Upvotes: 3

Views: 5170

Answers (1)

dreamcrash
dreamcrash

Reputation: 51393

At first, it seems that the first version has a lower thread creation overhead, since it will only create the threads once. While in the second version it seems that the threads will be created dimension times.

But according to this

One may be worried about the creation of new threads within the inner loop. Worry not, the libgomp in GCC is smart enough to actually only creates the threads once. Once the team has done its work, the threads are returned into a "dock", waiting for new work to do.

In other words, the number of times the clone system call is executed is exactly equal to the maximum number of concurrent threads. The parallel directive is not the same as a combination of pthread_create and pthread_join.

On the first version, you should guarantee that the variable j is also private.

Instead of having two approaches you can just have one where the nested loop is parallelized. In OpenMP 3.0, the parallelization of nested loops can be handled by the collapse clause in the for directive, namely:

void mat_multiply ( void ) {
   
    #pragma omp parallel for collapse(2)
    for(int i = 0; i < dimension; i++)
      for(int j = 0; j < dimension; j++)
        for(int k = 0; k < dimension; k++)
            C[dimension*i+j] += A[dimension*i+k] *  B[dimension*k+j];        
  }

Btw: Have a look into a block approach, you can see an example here (starting in slide 62).

Upvotes: 5

Related Questions