Reputation: 23
I'm a new OpenMp Programer and now I got a problem with multiplying two matrices. This is my parallel code but it is not as fast as I expected. For example I give it a 3000 * 3000 matrix and 3000 * 3000 and my Domain is 2 ( the random number is 0 or 1 ) and parallel is slower than sequential
clock_t tStart = clock();
cout<<(char)169<<" parallel "<<(char)170<<endl;
int a,b,c,Domain ;
cin>>a>>b>>c>>Domain;
srand(time(0));
int **arr1;
int **arr2;
int **arrRet;
arr1 = new int*[a];
#pragma omp for schedule (dynamic)
for(int i=0 ; i<a ; i++)
arr1[i] = new int [b];
arr2 = new int*[b];
#pragma omp for schedule (dynamic)
for(int i=0 ; i<b ; i++)
arr2[i] = new int [c];
arrRet = new int*[a];
#pragma omp for schedule (dynamic)
for(int i=0 ; i<a ; i++)
arrRet[i] = new int [c];
#pragma omp for schedule (dynamic)
for(int i=0 ; i<a ; i++)
{
#pragma omp for schedule (dynamic)
for(int j=0; j<b ; j++)
{
arr1[i][j]=rand()%Domain;
}
}
//cout<<"\n\n\n";
#pragma omp for schedule (dynamic)
for(int i=0 ; i<b ; i++)
{
#pragma omp for schedule (dynamic)
for(int j=0 ; j<c ; j++)
{
arr2[i][j]=rand()%Domain;
}
}
//cout<<"\n\n\n";
#pragma omp for schedule (dynamic)
for(int i=0 ; i<a ; i++)
#pragma omp for schedule (dynamic)
for(int j2=0 ; j2<c ; j2++)
{
int sum=0;
#pragma omp parallel for shared(sum) reduction(+:sum)
for(int j=0 ; j<b ; j++)
{
sum+=arr1[i][j]*arr2[j][j2];
}
arrRet[i][j2]=sum;
}
printf("Time taken : %.4fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
Upvotes: 0
Views: 1224
Reputation: 3096
There are many highly optimized linear algebra libraries that are free to use. I strongly suggest you to use one of those whenever possible.
Your performance degradation may be produced by many reasons. The following list details some of the most common causes:
Use of schedule(dynamic)
when the amount of work per iteration is completely balanced. Omitting the clause will set the schedule to static
, which is more appropriate for this type of parallelization.
Excessive pressure on the memory allocation. You don't actually need to reserve multiple memory regions for a single matrix. Since the matrix size does not change in your program, you can perfectly use a single allocation for each matrix. This also improves data locality, as contiguous rows are close to each other in memory. Then, you can access each element using A[ i * b + j ]
, where b
is the number of columns.
int *A = (int *) malloc( a * b * sizeof(int) );
In your code, you seem to have missed a parallel
region. This causes that all the omp for
with the exception of the last one, are not executed by multiple threads.
Merge your omp for
constructs in nested loops using collapse(2)
as in the following example:
#pragma omp for collapse(2)
for( i = 0; i < a; i++ ) {
for( j = 0; j < b; j++ ) {
// your parallel code
}
}
Upvotes: 1