Reputation: 957
I am new to OpenMP and I am trying to paralelize following code using OpenMP:
#pragma omp parallel for
for(int k=0;k<m;k++)
{
for(int j=n-1;j>=0;j--)
{
outX[k+j*m] = inB2[j+n * k] / inA2[j*n + j];
for(int i=0;i<j;i++)
{
inB2[k*n+i] -= inA2[i+n * j] * outX[k + m*j];
}
}
}
Paralelize the outer cycle is pretty straight-forward, but to optimize it, I wanted to paralelize the inner-most cycle (the one iterating over i) as well. But when I try to do that like this:
#pragma omp parallel for
for(int i=0;i<j;i++)
{
inB2[k*n+i] -= inA2[i+n * j] * outX[k + m*j];
}
the compiler does not vectorize the inner cycle ("loop versioned for vectorization because of possible aliasing"), which makes it run slower. I compiled it using gcc -ffast-math -std=c++11 -fopenmp -O3 -msse2 -funroll-loops -g -fopt-info-vec prog.cpp
Thanks for any advice!
EDIT: I am using __restrict keyword for the arrays.
EDIT2: Interesting is, that when I keep only the pragma in the inner cycle and remove it from the outer, gcc will vectorize it. So the problem only happens, when I try to paralelize both cycles.
EDIT3: The compiler will vectorize the loop when I use #pragma omp parallel for simd. But it's still slower than without parallelizing the inner loop at all.
Upvotes: 8
Views: 763
Reputation: 957
Thanks everyone for answers. I managed to vectorize the inner loop using #pragma omp parallel for simd
, but the program was slower than without parallelizing. I eventually found a slightly different algorithm to solve the problem, which is much faster.
Thanks for your help guys!
Upvotes: 1
Reputation: 724
Have you tried making the inner loop vecotorzed first? and then adding the parallel part (which might result in slower performance depending on cache misses)
#pragma omp parallel for
for(int k=0;k<m;k++)
{
for(int j=n-1;j>=0;j--)
{
outX[k+j*m] = inB2[j+n * k] / inA2[j*n + j];
Q1 = k*n
Q2 = n*j
Q3 = m*j + k
#pragma omp declare simd private(i,j,k,m,Q1,Q2,Q3) linear(i) uniform(outX,inA2,inB2) shared(inB2,inA2,outX)
for(int i=0;i<j;i++)
{
inB2[Q1+i] -= inA2[Q2+i] * outX[Q3];
}
}
}
It always take me some time getting the #pragma right with the shared, public etc... And I did not test this.
Upvotes: 1
Reputation: 8238
My guess is that after you parallelized the inner loop, your compiler lost the track of inA2
, inB2
and outX
. By default, it assumes that any regions of memory pointed by any pointers may overlap with each other. In C language the C99 Standard introduced restrict
keyword, which informs the compiler that a pointer points to a memory block which is not pointed by any other pointer. C++ haven't got such a keyword, but, fortunately, g++
has an appropriate extension. So try to add __restrict
to declarations of the pointers touched by the loop. For example, replace
double* outX;
with
double* __restrict outX;
Upvotes: 1