CIVI89
CIVI89

Reputation: 91

OpenMP reduction synchronization error

I'm trying to parallelize a loop with interdependent cycles, I've tried with a reduction and the code work, but the result is wrong, I think that the reduction works for the sum but not for the update of the array in the right cycle, there is a way to obtain the right result parallelizing the loop?

#pragma omp parallel for reduction(+: sum)
for (int i = 0; i < DATA_MAG; i++)
{
    sum += H[i];

    LUT[i] = sum * scale_factor;
}

Upvotes: 3

Views: 639

Answers (1)

Z boson
Z boson

Reputation: 33679

The reduction clause creates private copies of sum for each thread in the team as if the private clause had been used on sum. After the for loop the result of each private copy is combined with the original shared value of sum. Since the shared sum is only updated after the for loop you cannot rely on it inside of the for loop.

In this case you need to do a prefix sum. Unfortunately the parallel prefix-sum using threads is memory bandwidth bound for large DATA_MAG and it's dominated by the OpenMP overhead for small DATA_MAG. However, there may be some sweet spot in between small and large where you seen some benefit using threads.

But in your case DATA_MAG is only 256 which is very small and would not benefit from OpenMP anyway. What you can do is use SIMD. However, OpenMP's simd construction is not powerful enough for the prefix sum as far as I know. However, you can do this by hand suing intrinsics like this

#include <stdio.h>
#include <x86intrin.h>

#define N 256

inline __m128 scan_SSE(__m128 x) {
  x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
  x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
  return x;
}

void prefix_sum_SSE(float *a, float *s, int n, float scale_factor) {
  __m128 offset = _mm_setzero_ps();
  __m128 f = _mm_set1_ps(scale_factor);
  for (int i = 0; i < n; i+=4) {
    __m128 x = _mm_loadu_ps(&a[i]);
    __m128 out = scan_SSE(x);
    out = _mm_add_ps(out, offset);
    offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3));
    out = _mm_mul_ps(out, f); 
    _mm_storeu_ps(&s[i], out);
  }
}

int main(void) {
  float H[N], LUT[N];
  for(int i=0; i<N; i++) H[i] = i;
  prefix_sum_SSE(H, LUT, N, 3.14159f);
  for(int i=0; i<N; i++) printf("%.1f ", LUT[i]); puts("");
  for(int i=0; i<N; i++) printf("%.1f ", 3.14159f*i*(i+1)/2); puts("");

}

See here for more details about SIMD pre-fix sum with SSE and AVX.

Upvotes: 2

Related Questions