Nancy
Nancy

Reputation: 11

Parallelize Nested Loops using OpenMP

I tried to parallelize the nested loop using OpenMP, but I am not sure if this is the correct way to do that. Here is the part of the code having nested loops. This is just a generic code. I am giving noofrecords as 50k, it takes a lot of time even after parallelization. Can someone suggest any better idea to parallelize the code. I am just parallelizing the outer loop in the below code.

int ctd=0;
#pragma omp parallel for default(none), private(j,k,l), shared(A,B,C,D,ctd)
for(int j=0; j <noofrecords; j++)
{
    for( int k=0; k<noofrecords; k++)
    {
        for( int l=0; l<noofrecords; l++)
        {
            if(condition)
            {
D[ctd].a1=A[i].a1;
ctd++;
              }}}}

Upvotes: 1

Views: 608

Answers (2)

Z boson
Z boson

Reputation: 33699

  1. create a temporary array of a1 of type D.a1 with a number of elements equal to the maximum value expected of ctd.
  2. create a temporary array a2 of a1 for each thread.
  3. Fill a2 in parallel and use ctd2 to count the size of a2
  4. Fill the array a1 in order from a2 and add ctd2 to ctd
  5. Write to D.a1 in parallel from a1

Something like this

int ctd=0;
double *a1 = malloc(sizeof *a1 * N);                       //Step 1
#pragma omp parallel
{
  int ctd2 = 0;
  double *a2 = malloc(sizeof *a2 * N);                     //step 2

  #pragma omp for nowait
  for(int j=0; j<noofrecords; j++)
  for(int k=0; k<noofrecords; k++)
  for(int l=0; l<noofrecords; l++)
    if(condition) a2[ctd2++] = A[i].a1;                    //step 3

  #pragma omp for schedule(static) ordered
  for(int i=0; i<omp_get_num_threads(); i++)
    #pragma omp ordered
    memcpy(&a1[ctd], a2, sizeof *a1 * ctd2), ctd += ctd2;  //step 4

  #pragma omp for
  for(int j=0; j<ctd; j++) D[j].a1 = a1[j];                // step 5
  free(a2);
}
free(a1);

N should be the maximum size you expect ctd to have. One memory inefficieny in this method is that a2 is allocated with size N as well which may be too large. A dynamic vector like std::vector in C++ or GArray in glib would be more memory efficient.

Upvotes: 0

L.C.
L.C.

Reputation: 1145

You can use the collapse clause, in your case you have 3 consecutive for loops so it would be something like: #pragma omp parallel for default(none), private(j,k,l), shared(A,B,C,D,ctd) collapse(3)

It will work if the for loops are consecutive and the code is in the inner-most loop (it's the case in the code you posted). It noofrecords is much bigger than your max thread count, the speedup won't be impressive. If it's slow even in parallel it probably means the bottleneck is not your processing power (more probably it's the ram already working at 100%).

Also, I'm not so sure you really want that private(j,k,l)...

Upvotes: 2

Related Questions