Honza Dejdar
Honza Dejdar

Reputation: 957

OpenMP paralelization inhibits vectorization

I am new to OpenMP and I am trying to paralelize following code using OpenMP:

#pragma omp parallel for
for(int k=0;k<m;k++)
{
   for(int j=n-1;j>=0;j--)
   {
       outX[k+j*m] = inB2[j+n * k] / inA2[j*n + j];

       for(int i=0;i<j;i++)
       {
           inB2[k*n+i] -= inA2[i+n * j] * outX[k + m*j];
       }
   }
}

Paralelize the outer cycle is pretty straight-forward, but to optimize it, I wanted to paralelize the inner-most cycle (the one iterating over i) as well. But when I try to do that like this:

#pragma omp parallel for
for(int i=0;i<j;i++)
{
    inB2[k*n+i] -= inA2[i+n * j] * outX[k + m*j];
}

the compiler does not vectorize the inner cycle ("loop versioned for vectorization because of possible aliasing"), which makes it run slower. I compiled it using gcc -ffast-math -std=c++11 -fopenmp -O3 -msse2 -funroll-loops -g -fopt-info-vec prog.cpp

Thanks for any advice!

EDIT: I am using __restrict keyword for the arrays.

EDIT2: Interesting is, that when I keep only the pragma in the inner cycle and remove it from the outer, gcc will vectorize it. So the problem only happens, when I try to paralelize both cycles.

EDIT3: The compiler will vectorize the loop when I use #pragma omp parallel for simd. But it's still slower than without parallelizing the inner loop at all.

Upvotes: 8

Views: 763

Answers (3)

Honza Dejdar
Honza Dejdar

Reputation: 957

Thanks everyone for answers. I managed to vectorize the inner loop using #pragma omp parallel for simd, but the program was slower than without parallelizing. I eventually found a slightly different algorithm to solve the problem, which is much faster. Thanks for your help guys!

Upvotes: 1

Holmz
Holmz

Reputation: 724

Have you tried making the inner loop vecotorzed first? and then adding the parallel part (which might result in slower performance depending on cache misses)

#pragma omp parallel for
for(int k=0;k<m;k++)
{
   for(int j=n-1;j>=0;j--)
   {
       outX[k+j*m] = inB2[j+n * k] / inA2[j*n + j];
Q1 = k*n
Q2 = n*j
Q3 = m*j + k
#pragma omp declare simd private(i,j,k,m,Q1,Q2,Q3) linear(i) uniform(outX,inA2,inB2) shared(inB2,inA2,outX)
       for(int i=0;i<j;i++)
       {
           inB2[Q1+i] -= inA2[Q2+i] * outX[Q3];
       }
   }
}

It always take me some time getting the #pragma right with the shared, public etc... And I did not test this.

Upvotes: 1

Sergey
Sergey

Reputation: 8238

My guess is that after you parallelized the inner loop, your compiler lost the track of inA2, inB2 and outX. By default, it assumes that any regions of memory pointed by any pointers may overlap with each other. In C language the C99 Standard introduced restrict keyword, which informs the compiler that a pointer points to a memory block which is not pointed by any other pointer. C++ haven't got such a keyword, but, fortunately, g++ has an appropriate extension. So try to add __restrict to declarations of the pointers touched by the loop. For example, replace

double* outX;

with

double* __restrict outX;

Upvotes: 1

Related Questions