DOSMarter
DOSMarter

Reputation: 1523

openmp parallel performance

I'm trying to implement the distance matrix in parallel using openmp in which I calculate the distance between each point and all the other points, so the best algorithm I thought of till now cost O(n^2) and the performance of my algorithm using openmp using 10 thread on 8processor machine isn't better than the serial approach in terms of running time, so I wonder if there is any mistake in my implementation on the openmp approach as this is my first time to use openmp, so please if there is any mistake in my apporach or any better "faster" approach please let me know. The following is my code where "dat" is a vector that contains the data points.

map <int, map< int, double> > dist;   //construct the distance matrix 

int c=count(dat.at(0).begin(),dat.at(0).end(),delm)+1;

#pragma omp parallel for shared (c,dist)

for(int p=0;p<dat.size();p++)
{

    for(int j=p+1;j<dat.size();j++)
    {
        double ecl=0;

        string line1=dat.at(p);
        string line2=dat.at(j);

        for (int i=0;i<c;i++)
        {

            double num1=atof(line1.substr(0,line1.find_first_of(delm)).c_str());

            line1=line1.substr(line1.find_first_of(delm)+1).c_str();

            double num2=atof(line2.substr(0,line2.find_first_of(delm)).c_str());

            line2=line2.substr(line2.find_first_of(delm)+1).c_str();

            ecl += (num1-num2)*(num1-num2);
        }

        ecl=sqrt(ecl);

#pragma omp critical
        {
            dist[p][j]=ecl;
            dist[j][p]=ecl;
        }
    }
}

Upvotes: 1

Views: 804

Answers (2)

Jason
Jason

Reputation: 304

As already pointed out, using critical sections will slow things down as only 1 thread is allowed in that section at a time. There is absolutely no need for using critical sections because each thread writes to mutually exclusive sections of data, reading non-modified data obviously doesn't need protection.

My suspicion as to the slowness of the code comes down to uneven work distribution over the threads. By default I think openmp divides the iterations equally among threads. As an example, consider when you have 8 threads and 8 points:

-thread 0 will get 7 distance calculations

-thread 1 will get 6 distance calculations

...

-thread 7 will get 0 distance calculations

Even with more iterations, a similar inequality still exists. If you need to convince yourself, make a thread private counter to track how many distance calculations are actually done by each thread.

With work-sharing constructs like parallel for, you can specify various work distribution strategies. In your case, probably best to go with

#pragma omp for schedule(guided)

When each thread requests some iterations of the for loop, it will get the number of remaining loops (not already given to a thread) divided by the number of threads. So initially you get big blocks, later you get smaller blocks. It's a form of automatic load balancing, mind you there's some (probably small) overhead in dynamically allocating iterations to the threads.

To avoid the first thread getting an unfair large amount of work, your looping structure should be changed so that lower iterations have fewer calculations, e.g. change the inner for loop to

for (j=0; j<p-1; j++)

Another thing to consider is when working with a lot of cores, memory can become the bottleneck. You have 8 processors fighting for probably 2 or maybe 3 channels of DRAM (separate memory sticks on the same channel still compete for bandwidth). On-chip CPU cache is at best shared between all the processors, so you still have no more cache than the serial version of this program.

Upvotes: 2

ildjarn
ildjarn

Reputation: 62995

#pragma omp critical has the effect of serializing your loop so getting rid of that should be your first goal. This should be a step in the right direction:

ptrdiff_t const c = count(dat[0].begin(), dat[0].end(), delm) + 1;
vector<vector<double> > dist(dat.size(), vector<double>(dat.size()));

#pragma omp parallel for
for (size_t p = 0; p != dat.size(); ++p)
{
  for (size_t j = p + 1; j != dat.size(); ++j)
  {
    double ecl = 0.0;
    string line1 = dat[p];
    string line2 = dat[j];
    for (ptrdiff_t i = 0; i != c; ++i)
    {
      double const num1 = atof(line1.substr(0, line1.find_first_of(delm)).c_str());
      double const num2 = atof(line2.substr(0, line2.find_first_of(delm)).c_str());

      line1 = line1.substr(line1.find_first_of(delm) + 1);
      line2 = line2.substr(line2.find_first_of(delm) + 1);
      ecl += (num1 - num2) * (num1 - num2);
    }

    ecl = sqrt(ecl);
    dist[p][j] = ecl;
    dist[j][p] = ecl;
  }
}

There are a few other obvious things that could be done to make this faster overall, but fixing your parallelization is the most important thing.

Upvotes: 2

Related Questions