Reputation: 187
I am new to OpenMP and am trying desperately to learn. I have tried to write an example code in C++ in visual studio 2012 to implement matrix multiplication. I was hoping someone with OpenMP experience could take a look at this code and help me to obtain the ultimate speed / parallelization for this:
#include <iostream>
#include <stdlib.h>
#include <omp.h>
#include <random>
using namespace std;
#define NUM_THREADS 4
// Program Variables
double** A;
double** B;
double** C;
double t_Start;
double t_Stop;
int Am;
int An;
int Bm;
int Bn;
// Program Functions
void Get_Matrix();
void Mat_Mult_Serial();
void Mat_Mult_Parallel();
void Delete_Matrix();
int main()
{
printf("Matrix Multiplication Program\n\n");
cout << "Enter Size of Matrix A: ";
cin >> Am >> An;
cout << "Enter Size of Matrix B: ";
cin >> Bm >> Bn;
Get_Matrix();
Mat_Mult_Serial();
Mat_Mult_Parallel();
system("pause");
return 0;
}
void Get_Matrix()
{
A = new double*[Am];
B = new double*[Bm];
C = new double*[Am];
for ( int i=0; i<Am; i++ ){A[i] = new double[An];}
for ( int i=0; i<Bm; i++ ){B[i] = new double[Bn];}
for ( int i=0; i<Am; i++ ){C[i] = new double[Bn]; }
for ( int i=0; i<Am; i++ )
{
for ( int j=0; j<An; j++ )
{
A[i][j]= rand() % 10 + 1;
}
}
for ( int i=0; i<Bm; i++ )
{
for ( int j=0; j<Bn; j++ )
{
B[i][j]= rand() % 10 + 1;
}
}
printf("Matrix Create Complete.\n");
}
void Mat_Mult_Serial()
{
t_Start = omp_get_wtime();
for ( int i=0; i<Am; i++ )
{
for ( int j=0; j<Bn; j++ )
{
double temp = 0;
for ( int k=0; k<An; k++ )
{
temp += A[i][k]*B[k][j];
}
}
}
t_Stop = omp_get_wtime() - t_Start;
cout << "Serial Multiplication Time: " << t_Stop << " seconds" << endl;
}
void Mat_Mult_Parallel()
{
int i,j,k;
t_Start = omp_get_wtime();
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private(i,j,k) schedule(dynamic)
for ( i=0; i<Am; i++ )
{
for ( j=0; j<Bn; j++ )
{
//double temp = 0;
for ( k=0; k<An; k++ )
{
C[i][j] += A[i][k]*B[k][j];
}
}
}
t_Stop = omp_get_wtime() - t_Start;
cout << "Parallel Multiplication Time: " << t_Stop << " seconds." << endl;
}
void Delete_Matrix()
{
for ( int i=0; i<Am; i++ ){ delete [] A[i]; }
for ( int i=0; i<Bm; i++ ){ delete [] B[i]; }
for ( int i=0; i<Am; i++ ){ delete [] C[i]; }
delete [] A;
delete [] B;
delete [] B;
}
Upvotes: 1
Views: 7402
Reputation: 4685
My examples are based on a matrix class I created for parallel teaching. If you are interested feel free to contact me. There are several ways to speedup your matrix multiplication :
Use a one dimension array in row major order for accessing the element in a faster way.
You can access to A(i,j) with A[i * An + j]
for (int i = 0; i < m; i ++)
for (int j = 0; j < p; j ++)
{
Scalar sigma = C(i, j);
for (int k = 0; k < n; k ++)
sigma += (*this)(i, k) * B(k, j);
C(i, j) = sigma;
}
This prevents to recompute C(i,j) several times in the most inner loop.
for (int i = 0; i < m; i ++)
for (int k = 0; k < n; k ++)
{
Aik = (*this)(i, k);
for (int j = 0; j < p; j ++)
C(i, j) += Aik * B(k, j);
}
This allows to play with spatial data locality
for(int ii = 0; ii < m; ii += block_size)
for(int jj = 0; jj < p; jj += block_size)
for(int kk = 0; kk < n; kk += block_size)
#pragma omp parallel for // I think this is the best place for this case
for(int i = ii; i < ii + block_size; i ++)
for(int k = kk; k < kk + block_size; k ++)
{
Scalar Aik = (*this)(i, k);
for(int j = jj; j < jj + block_size; j ++)
C(i, j) += Aik * B(k, j);
}
This can use better temporal data locality. The optimal block_size depends on your architecture and matrix size.
Generally, the #pragma omp parallel for should be done a the most outter loop. Maybe using two parallel loop at the two first outter loops can give better results. It depends then on the architecture you use, the matrix size... You have to test ! Since the matrix multiplication has a static workload I would use a static schedule.
You can do loop nest optimization. You can vectorize your code. You can take look at how BLAS do it.
Upvotes: 3
Reputation: 1
I am very new to OpenMP and this code is very instructive. However I found an error in the serial version that gives it an unfair speed advantage over the parallel version.
Instead of writing C[i][j] += A[i][k]*B[k][j];
as you do in the parallel version, you have written temp += A[i][k]*B[k][j];
in the serial version. This is much faster (but doesn't help you compute the C matrix). So you're not comparing apples to apples, which makes the parallel code seem slower by comparison. When I fixed this line and ran it on my laptop (which allows 2 threads), the parallel version was almost twice as fast. Not bad!
Upvotes: 0