Reputation: 63
I have a requirement to calculate z
values, push them into arrays B
and s2
.
I tried to parallelize the processing using omp parallel for
.
One problem I see is, If I don't put B[i][j] += z
and s2[i] += z
statements in critical
section, I see lot of NaN
values being generated.
Just wondering if there is a way to write the z
values to separate arrays (one array per thread) and merge them at the end.
Any help is greatly appreciated.
#pragma omp parallel
{
double z;
#pragma omp parallel for
for(int t=1; t<n; t++) {
double phi_i[N];
double obs_j_seq_t[N];
for(int i=0; i<N; i++) {
for(int j=0; j<N; j++) {
z=phi_i[i]*trans[i*N + j]*obs_j_seq_t[j]*beta[t*N+j]/c[t];
#pragma omp critical
{
B[i][j] += z;
s2[i] += z;
}
}
}
}
}
Upvotes: 3
Views: 1305
Reputation: 9499
Your code exposes a few issues, each being a potential killer for its performance and / or validity:
#pragma omp parallel
and then you add a #pragma omp parallel for
. That means that you are trying to generate nested parallelism (a parallel region within another parallel region). This is first, a bad idea and second, disabled by default. Therefore, your second parallel
directive is ignored and the work on your loop never gets distributed and is executed in full by all the threads you spawned with your initial parallel
directive. Therefore, you have race conditions on the writing of the results in B
and s2
by all the threads at once. You solve the issue by adding a critical
section, but fundamentally, the code is wrong.parallel
directive or with nested parallelism enabled, your code would have been wrong for the following reasons:
z
variable is shared
across the threads of the second parallel
region and since it is modified by all of them, its value is undefined as soon as more than one thread is spawned in the region.t
, but the solutions are indexed over i
. That means that all threads will compete for updating the same indexes, leading once more to race conditions and invalid results. You could again use a critical
directive to address that, but that would only make the code super slow. You'd better be parallelizing the loop over i
(while possibly swapping the loops over t
and i
to put the latter the outermost one).Your code could become something like this (not tested):
#pragma omp parallel for
for(int i=0; i<N; i++) {
for(int t=1; t<n; t++) {
double phi_i[N]; // I guess these need some initialization
double obs_j_seq_t[N]; // Idem
for(int j=0; j<N; j++) {
double z=phi_i[i]*trans[i*N + j]*obs_j_seq_t[j]*beta[t*N+j]/c[t];
B[i][j] += z;
s2[i] += z;
}
}
}
Upvotes: 2