Reputation: 91
I'm trying to parallelize a loop with interdependent cycles, I've tried with a reduction and the code work, but the result is wrong, I think that the reduction works for the sum but not for the update of the array in the right cycle, there is a way to obtain the right result parallelizing the loop?
#pragma omp parallel for reduction(+: sum)
for (int i = 0; i < DATA_MAG; i++)
{
sum += H[i];
LUT[i] = sum * scale_factor;
}
Upvotes: 3
Views: 639
Reputation: 33679
The reduction clause creates private copies of sum
for each thread in the team as if the private clause had been used on sum
. After the for loop the result of each private copy is combined with the original shared value of sum
. Since the shared sum
is only updated after the for loop you cannot rely on it inside of the for loop.
In this case you need to do a prefix sum. Unfortunately the parallel prefix-sum using threads is memory bandwidth bound for large DATA_MAG
and it's dominated by the OpenMP overhead for small DATA_MAG
. However, there may be some sweet spot in between small and large where you seen some benefit using threads.
But in your case DATA_MAG
is only 256 which is very small and would not benefit from OpenMP anyway. What you can do is use SIMD. However, OpenMP's simd
construction is not powerful enough for the prefix sum as far as I know. However, you can do this by hand suing intrinsics like this
#include <stdio.h>
#include <x86intrin.h>
#define N 256
inline __m128 scan_SSE(__m128 x) {
x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
return x;
}
void prefix_sum_SSE(float *a, float *s, int n, float scale_factor) {
__m128 offset = _mm_setzero_ps();
__m128 f = _mm_set1_ps(scale_factor);
for (int i = 0; i < n; i+=4) {
__m128 x = _mm_loadu_ps(&a[i]);
__m128 out = scan_SSE(x);
out = _mm_add_ps(out, offset);
offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3));
out = _mm_mul_ps(out, f);
_mm_storeu_ps(&s[i], out);
}
}
int main(void) {
float H[N], LUT[N];
for(int i=0; i<N; i++) H[i] = i;
prefix_sum_SSE(H, LUT, N, 3.14159f);
for(int i=0; i<N; i++) printf("%.1f ", LUT[i]); puts("");
for(int i=0; i<N; i++) printf("%.1f ", 3.14159f*i*(i+1)/2); puts("");
}
See here for more details about SIMD pre-fix sum with SSE and AVX.
Upvotes: 2