Reputation: 187

OpenMP for matrix multiplication

I am new to OpenMP and am trying desperately to learn. I have tried to write an example code in C++ in visual studio 2012 to implement matrix multiplication. I was hoping someone with OpenMP experience could take a look at this code and help me to obtain the ultimate speed / parallelization for this:

#include <iostream>
#include <stdlib.h>
#include <omp.h>
#include <random>
using namespace std;

#define NUM_THREADS 4

// Program Variables
double**        A;
double**        B;
double**        C;
double          t_Start;
double          t_Stop;
int             Am;
int             An;
int             Bm;
int             Bn;

// Program Functions
void            Get_Matrix();
void            Mat_Mult_Serial();
void            Mat_Mult_Parallel();
void            Delete_Matrix();


int main()
{
    printf("Matrix Multiplication Program\n\n");
    cout << "Enter Size of Matrix A: ";
    cin >> Am >> An;
    cout << "Enter Size of Matrix B: ";
    cin >> Bm >> Bn;

    Get_Matrix();
    Mat_Mult_Serial();
    Mat_Mult_Parallel();


    system("pause");
    return 0;

}


void Get_Matrix()
{
    A = new double*[Am];
    B = new double*[Bm];
    C = new double*[Am];
    for ( int i=0; i<Am; i++ ){A[i] = new double[An];}
    for ( int i=0; i<Bm; i++ ){B[i] = new double[Bn];}
    for ( int i=0; i<Am; i++ ){C[i] = new double[Bn]; }

    for ( int i=0; i<Am; i++ )
    {
         for ( int j=0; j<An; j++ )
         {
             A[i][j]= rand() % 10 + 1;
         }
    }

    for ( int i=0; i<Bm; i++ )
    {
        for ( int j=0; j<Bn; j++ )
        {
            B[i][j]= rand() % 10 + 1;
        }
    }
    printf("Matrix Create Complete.\n");
}


void Mat_Mult_Serial()
{
    t_Start = omp_get_wtime();
    for ( int i=0; i<Am; i++ )
    {
        for ( int j=0; j<Bn; j++ )
        {
            double temp = 0;
            for ( int k=0; k<An; k++ )
            {
                temp += A[i][k]*B[k][j];
            }
        }
    }
    t_Stop = omp_get_wtime() - t_Start;
    cout << "Serial Multiplication Time: " << t_Stop << " seconds" << endl;
    }


void Mat_Mult_Parallel()
{
    int i,j,k;
    t_Start = omp_get_wtime();

    omp_set_num_threads(NUM_THREADS);
    #pragma omp parallel for private(i,j,k) schedule(dynamic)
    for ( i=0; i<Am; i++ )
    {
        for ( j=0; j<Bn; j++ )
        {
            //double temp = 0;
            for ( k=0; k<An; k++ )
            {
                C[i][j] += A[i][k]*B[k][j];
            }
        }
    }

    t_Stop = omp_get_wtime() - t_Start;
    cout << "Parallel Multiplication Time: " << t_Stop << " seconds." << endl;
}


void Delete_Matrix()
{
    for ( int i=0; i<Am; i++ ){ delete [] A[i]; }
    for ( int i=0; i<Bm; i++ ){ delete [] B[i]; }
    for ( int i=0; i<Am; i++ ){ delete [] C[i]; }

    delete [] A;
    delete [] B;
    delete [] B;
}

Upvotes: 1

Answers (2)

coincoin

Reputation: 4685

My examples are based on a matrix class I created for parallel teaching. If you are interested feel free to contact me. There are several ways to speedup your matrix multiplication :

Storage

Use a one dimension array in row major order for accessing the element in a faster way.
You can access to A(i,j) with A[i * An + j]

Use loop invariant optimization

for (int i = 0; i < m; i ++)
    for (int j = 0; j < p; j ++)
    {
        Scalar sigma = C(i, j);
        for (int k = 0; k < n; k ++)
            sigma += (*this)(i, k) * B(k, j);
        C(i, j) = sigma;
    }

This prevents to recompute C(i,j) several times in the most inner loop.

Change loop order "for k <-> for i"

for (int i = 0; i < m; i ++)
    for (int k = 0; k < n; k ++)
    {
        Aik = (*this)(i, k);
        for (int j = 0; j < p; j ++)
            C(i, j) += Aik * B(k, j);
    }

This allows to play with spatial data locality

Use loop blocking/tiling

for(int ii = 0; ii < m; ii += block_size)
    for(int jj = 0; jj < p; jj += block_size)
        for(int kk = 0; kk < n; kk += block_size)
            #pragma omp parallel for // I think this is the best place for this case
            for(int i = ii; i < ii + block_size; i ++)
                for(int k = kk; k < kk + block_size; k ++)
                {
                    Scalar Aik = (*this)(i, k);
                    for(int j = jj; j < jj + block_size; j ++)
                        C(i, j) +=  Aik * B(k, j);
                }

This can use better temporal data locality. The optimal block_size depends on your architecture and matrix size.

Then parallelize !

Generally, the #pragma omp parallel for should be done a the most outter loop. Maybe using two parallel loop at the two first outter loops can give better results. It depends then on the architecture you use, the matrix size... You have to test ! Since the matrix multiplication has a static workload I would use a static schedule.

Moar optimization !

You can do loop nest optimization. You can vectorize your code. You can take look at how BLAS do it.

Upvotes: 3

Catherine

Reputation: 1

I am very new to OpenMP and this code is very instructive. However I found an error in the serial version that gives it an unfair speed advantage over the parallel version.

Instead of writing C[i][j] += A[i][k]*B[k][j]; as you do in the parallel version, you have written temp += A[i][k]*B[k][j]; in the serial version. This is much faster (but doesn't help you compute the C matrix). So you're not comparing apples to apples, which makes the parallel code seem slower by comparison. When I fixed this line and ran it on my laptop (which allows 2 threads), the parallel version was almost twice as fast. Not bad!

Upvotes: 0