Clarification on "region cannot be closely nested inside 'parallel for' region"

Question

I am trying to understand how reduction works in OpenMP. I have this simple code that involves reduction.

#include 
#include 
#include 
int N = 100;
int M = 200;
int O = 300;

double r2() {
    return ((double) rand() / (double) RAND_MAX);
}

int main(void) {
    double S = 0;
    double *K = (double*) calloc(M * N, sizeof(double));
    #pragma omp parallel for collapse(2)
    {
      for (int m = 0; m < M; m++) {
         for (int n = 0; n < N; n++) {
            #pragma omp for reduction(+:S)
            for (int o = 0; o < O; o++) {
                S += r2() - 0.25;
            }
            K[m * N + n] = S;
         }
      }
   }
}

I get this error message

Blockquote test.cc:30:1: error: region cannot be closely nested inside 'parallel for' region; perhaps you forget to enclose 'omp for' directive into a parallel region? #pragma omp for reduction(+:S) ^

It complies if I do

#pragma omp parallel for reduction(+:S)

Is this the right way to do a nested loop?

EDIT: Making a change in the original question. I want the parallel and sequential code to have the same result.

#pragma omp parallel for collapse(2)
for (int m = 0; m < M; m++) {
    for (int n = 0; n < N; n++) {
        #pragma omp for reduction(+:S)
        for (int o = 0; o < O; o++) {
            S += o;
        }
        K[m * N + n] = S;
    }
}

dreamcrash · Accepted Answer

IMPORTANT TL;DR rand is not thread safe:

From the rand man page:

The function rand() is not reentrant or thread-safe, since it uses hidden state that is modified on each call.

for multithread code use (for instance) rand_r instead.

I am trying to understand how reduction works in OpenMP.

For the sake of argument let us assume that r2() would yield always the same values.

When one's code has multiple threads concurrently modifying a certain variable, and the code looks like the following:

   double S = 0;
   #pragma omp parallel 
   for (int o = 0; o < O; o++) {
        S += r2() - 0.25;
    }

one has a race condition on the updates of the variable S. To solve it, one can use the OpenMP reduction clause, which from the OpenMP standard one can read:

The reduction clause can be used to perform some forms of recurrence calculations (...) in parallel. For parallel and work-sharing constructs, a private copy of each list item is created, one for each implicit task, as if the private clause had been used. (...) The private copy is then initialized as specified above. At the end of the region for which the reduction clause was specified, the original list item is updated by combining its original value with the final value of each of the private copies, using the combiner of the specified reduction-identifier.

In that case the code would look like the following:

    #pragma omp for reduction(+:S)
    for (int o = 0; o < O; o++) {
        S += r2() - 0.25;
    }

However, in your full code

#pragma omp parallel for collapse(2)
for (int m = 0; m < M; m++) {
    for (int n = 0; n < N; n++) {
        #pragma omp for reduction(+:S)
        for (int o = 0; o < O; o++) {
            S += r2() - 0.25;
        }
        K[m * N + n] = S;
    }
}

you first divide the iterations of the two outer loops using the #pragma omp for collapse(2), and then you try to divide again the iterations of the inner most loop with a different clause #pragma omp for and this is not allowed.

Is this the right way to do a nested loop?

You could do the following parallelization:

#pragma omp parallel for collapse(2) firstprivate (S)
for (int m = 0; m < M; m++) {
    for (int n = 0; n < N; n++) {
        for (int o = 0; o < O; o++) {
            S += r2() - 0.25;
        }
        K[m * N + n] = S;
    }
}

There is no race-condition because the variable S is private. Moreover, in this case, since the iterations of the two outermost loops are divided among threads, each thread has a unique pair of m and n iterations, consequently each thread will access a unique position of the array K during its access K[m * N + n].

But the problem is that a version that parallelizes the two outer loops will not yield the same results as its sequential counterpart. This is the case because

        for (int o = 0; o < O; o++) {
            S += r2() - 0.25;
        }
        K[m * N + n] = S;

adds an implicitly dependency in all the iterations of the three loops.

The value of S is explicitly dependent on the order on which the iterations m, n and o are executed. Therefore, if you divide the iterations of those loops among threads the values of S of a given m and n will be not be the same if the code is executed sequentially or in parallel. Notwithstanding, this can be solved by only parallelizing the inner most loop and reducing the variable S:

for (int m = 0; m < M; m++) {
   for (int n = 0; n < N; n++) {
        #pragma omp parallel for reduction(+:S)
        for (int o = 0; o < O; o++) {
            S += r2() - 0.25;
        }
        K[m * N + n] = S;
    }
}

All of this is (of course) important if you care about the values of S, because one might argue that since you are using a function that yields random values, keeping the order of the values of S is not paramount.

The versions with the thread-safe random generator

Version 1

#pragma omp parallel
{
    unsigned int myseed = omp_get_thread_num();
    #pragma omp for collapse(2)
    for (int m = 0; m < M; m++) {
      for (int n = 0; n < N; n++) {
        for (int o = 0; o < O; o++) {
            double r = ((double) rand_r(&myseed) / (double) RAND_MAX);
            S += r - 0.25;
        }
     K[m * N + n] = S;
     }
   }

}

Version 2

double *K = (double*) calloc(M * N, sizeof(double));
for (int m = 0; m < M; m++) {
   for (int n = 0; n < N; n++) {
        #pragma omp parallel
        {
            unsigned int myseed = omp_get_thread_num();
            #pragma omp for reduction(+:S)
            for (int o = 0; o < O; o++) {
                    double r = ((double) rand_r(&myseed) / (double) RAND_MAX);
                    S += r - 0.25;
            }
        }
        K[m * N + n] = S;
    }
}

EDIT:

Making a change in the original question. I want the parallel and sequential code to have the same result.

Instead of :

#pragma omp parallel for collapse(2)
for (int m = 0; m < M; m++) {
    for (int n = 0; n < N; n++) {
        #pragma omp for reduction(+:S)
        for (int o = 0; o < O; o++) {
            S += o;
        }
        K[m * N + n] = S;
    }
}

do:

    for (int m = 0; m < M; m++) {
      for (int n = 0; n < N; n++) {
        #pragma omp parallel for reduction(+:S)
        for (int o = 0; o < O; o++) {
            S += o;
        }
        K[m * N + n] = S;
      }
   }

Clarification on "region cannot be closely nested inside 'parallel for' region"

Answers (1)

Related Questions

Clarification on &quot;region cannot be closely nested inside &#39;parallel for&#39; region&quot;

Answers (1)

Related Questions

Clarification on "region cannot be closely nested inside 'parallel for' region"