Reputation: 70
I have a c++ program that multiplies 2 matrixes. I have to use openMP. This is what I have so far. https://pastebin.com/wn0AXFBG
#include <stdlib.h>
#include <time.h>
#include <omp.h>
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
int n = 1;
int Matrix1[1000][100];
int Matrix2[100][2];
int Matrix3[1000][2];
int sum = 0;
ofstream fr("rez.txt");
double t1 = omp_get_wtime();
omp_set_num_threads(n);
#pragma omp parallel for collapse(2) num_threads(n)
for ( int i = 0; i < 10; i++) {
for ( int j = 0; j < 10; j++) {
Matrix1[i][j] = i * j;
}
}
#pragma omp simd
for (int i = 0; i < 100; i++) {
for (int j = 0; j < 2; j++) {
int t = rand() % 100;
if (t < 50) Matrix2[i][j] = -1;
if (t >= 50) Matrix2[i][j] = 1;
}
}
#pragma omp parallel for collapse(3) num_threads(n)
for (int ci = 0; ci < 1000; ci++) {
for (int cj = 0; cj < 2; cj++) {
for (int i = 0; i < 100; i++) {
if(i==0) Matrix3[ci][cj] = 0;
Matrix3[ci][cj] += Matrix1[ci][i] * Matrix2[i][cj];
}
}
}
double t2 = omp_get_wtime();
double time = t2 - t1;
fr << time;
return 0;
}
The problem is that I get the same execution times whether I use 1 thread or 8. Pictures of timing added.
I have to show that the time is reduced near to 8 times. I am using the Intel C++ compiler with openMP turned on. Please advise.
Upvotes: 1
Views: 791
Reputation: 26
First of all, I think, there is a small bug in your program, when you are initializing entries in matrix 1 as Matrix1[i][j] = i * j
. The i
and j
are not going upto 1000 and 100 respectively.
Also, I am not sure if your computer actually supports 8 logical cores or not,
If there are no 8 logical cores then your computer will create 8 threads and one logical core will context switch more than one threads and thus will bring the performance down and thus, high execution time. So be sure about how many actual logical cores are available and specify less than or equal to that amount of cores to num_threads()
Now coming to the question, collapse clause fuses all the loops into one and tries to dynamically schedule that fused loop among p
processors. I am not sure about how it deals with the race condition handling, but if you try to parallelize innermost loop without fusing all 3 loops, there is race condition as each thread will try to concurrently update Matrix3[ci][cj]
and some kind of synchronization mechanism maybe atomic or reduction clause are needed to ensure correctness.
I am pretty sure that you can parallelize outer loop without any kind of race condition and also get a speedup near the number of processors you have employed (Again, as far as number of processors are less than or equal to number of logical cores) and I would suggest changing segment of your code as below.
// You can also use this function to set number of threads:
// omp_set_num_threads(n);
#pragma omp parallel for num_threads(n)
for (int ci = 0; ci < 1000; ci++) {
for (int cj = 0; cj < 2; cj++) {
for (int i = 0; i < 100; i++) {
if(i==0) Matrix3[ci][cj] = 0;
Matrix3[ci][cj] += Matrix1[ci][i] * Matrix2[i][cj];
}
}
}
Upvotes: 1