Chandra550
Chandra550

Reputation: 63

omp parallel for : How to make threads write to private arrays and merge all the arrays once all the threads finished processing

I have a requirement to calculate z values, push them into arrays B and s2.

I tried to parallelize the processing using omp parallel for.

One problem I see is, If I don't put B[i][j] += z and s2[i] += z statements in critical section, I see lot of NaN values being generated.

Just wondering if there is a way to write the z values to separate arrays (one array per thread) and merge them at the end.

Any help is greatly appreciated.

#pragma omp parallel
{
    double z;
    #pragma omp parallel for
    for(int t=1; t<n; t++) {
        double phi_i[N];
        double obs_j_seq_t[N];

        for(int i=0; i<N; i++) {
            for(int j=0; j<N; j++) {
                z=phi_i[i]*trans[i*N + j]*obs_j_seq_t[j]*beta[t*N+j]/c[t]; 
                #pragma omp critical
                { 
                    B[i][j] += z;
                    s2[i] += z;
                }
            }
        }
    }
}

Upvotes: 3

Views: 1305

Answers (1)

Gilles
Gilles

Reputation: 9499

Your code exposes a few issues, each being a potential killer for its performance and / or validity:

  • You start by using a #pragma omp parallel and then you add a #pragma omp parallel for. That means that you are trying to generate nested parallelism (a parallel region within another parallel region). This is first, a bad idea and second, disabled by default. Therefore, your second parallel directive is ignored and the work on your loop never gets distributed and is executed in full by all the threads you spawned with your initial parallel directive. Therefore, you have race conditions on the writing of the results in B and s2 by all the threads at once. You solve the issue by adding a critical section, but fundamentally, the code is wrong.
  • Even if you hadn't had this initial parallel directive or with nested parallelism enabled, your code would have been wrong for the following reasons:
    • Your z variable is shared across the threads of the second parallel region and since it is modified by all of them, its value is undefined as soon as more than one thread is spawned in the region.
    • Even more fundamentally, you try to parallelize the loop over t, but the solutions are indexed over i. That means that all threads will compete for updating the same indexes, leading once more to race conditions and invalid results. You could again use a critical directive to address that, but that would only make the code super slow. You'd better be parallelizing the loop over i (while possibly swapping the loops over t and i to put the latter the outermost one).

Your code could become something like this (not tested):

#pragma omp parallel for
for(int i=0; i<N; i++) {
    for(int t=1; t<n; t++) {
        double phi_i[N];       // I guess these need some initialization
        double obs_j_seq_t[N]; // Idem
        for(int j=0; j<N; j++) {
            double z=phi_i[i]*trans[i*N + j]*obs_j_seq_t[j]*beta[t*N+j]/c[t]; 
            B[i][j] += z;
            s2[i] += z;
        }
    }
}

Upvotes: 2

Related Questions