Luc Nguyen
Luc Nguyen

Reputation: 101

How can I speed up a code C++ using OpenMP?

I am trying to parallelize the following code C++ using OpenMP:

int np = 1000000;
double kk = 1 / pow(2 * pi, 2);
for (int kes = 1; kes <= 100; kes++) {
  double E1 = 0;
  #pragma omp parallel for reduction(+: E1)
  for (int ies = 0; ies < np; ies++) {
    for (int jes = 0; jes < np; jes++) {
      if (ies != jes) {
        float distanes = sqrt(pow(xp[ies] - xp[jes], 2) + pow(yp[ies] - yp[jes], 2) + pow(zp[ies] - zp[jes], 2));
        float distan = kes * distanes;
        if (distan <= 5) {
          float gpspec = gpx[ies] * gpx[jes] + gpy[ies] * gpy[jes] + gpz[ies] * gpz[jes];
          E1 = E1 + kk * gpspec * sin(kes * distanes) / (kes * distanes);
        }
      }
    }
  }
  Ees[kes] = exp(-kes * pow(sp, 2) / 2) * E1;
}

This code is parallelized. However, the calculation time is still terrible. How can I speed up this calculation with n^2 operations? xp, yp, zp, gpx, gpy, gpz are 1-dimensional vector.

Upvotes: 2

Views: 686

Answers (2)

paddy
paddy

Reputation: 63451

There is no real answer to this question, but I'd like to distill some of the more important optimizations discussed in the comments. Let's focus on just the inner loops.

Primarily, you need to avoid excessive multiplications and function calls. And there are some tricks that aren't guaranteed to be optimized by compilers. For example, we know intuitively that pow(x, 2) just squares a value, but if your compiler doesn't optimize this, then it's much less efficient than simply x * x.

Further, it was identified that the O(N2) loop actually can be reduced to O(N2/2) because distances are symmetric. This is a big deal, if you're calling expensive things like pow and sqrt. You can just scale the final result of E1 by 2 to compensate for halving the number of calculations.

And on the subject of sqrt, it was also identified that you don't need to do that before your distance test. Do it after, because the test sqrt(d) < 5 is the same as d < 25.

Let's go even further, beyond the comments. Notice that the < 5 test actually relies on a multiplication involving kes. If you precomputed a distance-squared value that also incorporates the kes scaling, then you have even fewer multiplications.

You can also remove the kk value from the E1 calculation. That doesn't need to happen in a loop... probably. By that, I mean you're likely to have floating point error in all these calculations. So every time you change something, your final result might be slightly different. I'm gonna do it anyway.

So... After that introduction, let's go!

for (kes = 1; kes <= 100; kes++)
{
  double E1 = 0;
  const float distance_sq_thresh = 25.0f / kes / kes;

#pragma omp parallel for reduction(+: E1)
  for (ies = 0; ies < np; ies++)
  {
    for (jes = ies+1; jes < np; jes++)
    {
      float dx = xp[ies] - xp[jes];
      float dy = yp[ies] - yp[jes];
      float dz = zp[ies] - zp[jes];

#if 0
      // From Tanveer Badar's suggestion, if distances are generally large.
      // This may be detrimental for large values of kes.
      if (abs(dx) > distance_sq_thresh ||
          abs(dy) > distance_sq_thresh ||
          abs(dz) > distance_sq_thresh)
      {
        continue;
      }
#endif

      float distance_sq = dx * dx + dy * dy + dz * dz;
      if (distance_sq <= distance_sq_thresh)
      {
        float distan = kes * sqrt(distance_sq);
        float gpspec = gpx[ies] * gpx[jes] + gpy[ies] * gpy[jes] + gpz[ies] * gpz[jes];
        E1 = E1 + gpspec * sin(distan) / distan;
      }
    }
  }

  Ees[kes] = exp(-kes * pow(sp, 2) / 2) * E1 * kk * 2.0f;
}

Now, this I guarantee will be faster than your existing code by a significant amount.

If you want to go further, in each kes loop, you could pre-generate a large table of values for sin(distan) / distan for the possible range, and just index into it when you need it. Generally, trig operations and divisions are slow. So if you can work out your acceptable error tolerance and create a sufficiently large pre-calculated table, this might be a good optimization too.


Update

You have posted an answer taking user dmuir's suggestion of running the kes loop as the inner loop to avoid repeating expensive calculations. However, in the process you also abandoned some of the principles that I laid out in my answer. I commented there, but let me just put them down in code for you.

First, pre-calculate the squared-distance thresholds:

const double max_distance = 5.0;
double distance_sq_thresh[101] = {0};
for (kes = 1; kes <= 100; kes++)
{
    distance_sq_thresh[kes] = max_distance * max_distance / kes / kes;
}

Now, the main section

const int np = 1000000;

for (ies = 0; ies < np; ies++){
    for (jes = ies+1; jes < np; jes++){
        double dxp = xp[ies] - xp[jes];
        double dyp = yp[ies] - yp[jes];
        double dzp = zp[ies] - zp[jes];
        double distance_sq = dxp * dxp + dyp * dyp + dzp * dzp;

        // If the first kes iteration won't pass the test, none will
        if (distance_sq > distance_sq_thresh[1])
            continue;

        const double distance = sqrt(distance_sq);
        const double gpspec = gpx[ies] * gpx[jes] + gpy[ies] * gpy[jes] + gpz[ies] * gpz[jes];

        // We can keep adding distance to 'distan' instead of multiplying in a loop
        double distan = 0.0;

        for (kes = 1; kes <= 100; kes++){
            // We know that the threshold decreases as kes increases, so break early
            if (distance_sq > distance_sq_thresh[kes])
                break;

            E1[kes] = E1[kes] + gpspec * sin(distan) / distan;
            distan += distance;
        }
    }
}

And finally applying the results. Since there's only 100, there's not really any point in making this parallel.

const double kk = 1.0 / pow(2.0 * pi, 2.0);
for (kes = 1; kes <= 100; kes++){
    Ees[kes] = exp(-kes * pow(sp, 2) / 2) * E1[kes] * kk * 2.0f;
}

You can see that with this approach, you're no longer able to parallelize the Ees array calculation because multiple threads might be writing to the same element simultaneously. Maybe OpenMP has a directive allowing some kind of array reduction. I don't know. Perhaps you can go looking.

Alternatively, you could allocate one of these arrays for each OpenMP thread and use the thread's ordinal to select the array. Then add the arrays together at the end. Essentially rolling your own array reduction.

Or you could even go ghetto, by defining a local array for E inside the outer-most loop, knowing that is local to the current OpenMP worker. Then, after the jes loop, you can acquire a lock on the main Ees array and update it from these local values. Since the jes loop is doing 1 million expensive calculations, and the lock-and-update is only doing 100 additions, I expect that most of the time there will be no threads blocked from the lock.

Upvotes: 5

Luc Nguyen
Luc Nguyen

Reputation: 101

Based on suggestions by @paddy and @dmuir (put kes loop inner and jes=ies+1), i modified code as below:

double kk = 1 / pow(2 * pi, 2);
int np = 1000000;
//#pragma omp parallel for reduction(+: E1)
for (ies = 0; ies < np; ies++){
    for (jes = ies+1; jes < np; jes++){


        double dxp = xp[ies] - xp[jes];
        double dyp = yp[ies] - yp[jes];
        double dzp = zp[ies] - zp[jes];
        double distance = sqrt( dxp * dxp + dyp * dyp + dzp * dzp );
        double gpspec = gpx[ies] * gpx[jes] + gpy[ies] * gpy[jes] + gpz[ies] * gpz[jes];

        for (kes = 1; kes <= 100; kes++){
        
            double distance_thresh = 5.0/kes;
            if (distance <= distance_thresh){

                double distan = kes * distance;
                E1[kes] = E1[kes] + gpspec * sin(distan) / distan;
            }
        }
}

#pragma omp parallel for
for (kes = 1; kes <= par_spec; kes++){

    Ees[kes] = exp(-kes * pow(sp, 2) / 2) * E1[kes] * kk * 2.0f;
}

It seems to be better. However, how can it be parallelized?

Upvotes: 0

Related Questions