Vectorisation of for-loop with data dependecy

Question

I have an implementation of BiCCG (Conjugate Gradient) based matrix solver which also accounts for periodicity. It happens to be the case that the implementation is compute intensive and the loop is not auto vectorized because of the dependency problem. I explored a bit and it seems a red-black gauss seidel algorithm is more efficiently parallelizable than a vanilla version(which also has a similar dependency problem).

Can this loop/algorithm be changed such that it can be efficiently vectorised?

 // FORWARD
        #pragma omp for schedule(static, nx/NTt)
        for (i=1; i<=nx; i++) for (j=1; j<=ny; j++) for (k=1; k<=nz; k++)
        {


            dummy = res_sparse_s[i][j][k];

                                           dummy -= COEFF[i][j][k][7] * RLL[i-1][j][k];
            if (PeriodicBoundaryX && i==nx)dummy -= COEFF[i][j][k][8] * RLL[1  ][j][k];


                                            dummy -= COEFF[i][j][k][2] * RLL[i][j-1][k];
            if (PeriodicBoundaryY && j==ny) dummy -= COEFF[i][j][k][3] * RLL[i][1  ][k];


                                            dummy -= COEFF[i][j][k][4] * RLL[i][j][k-1];
            if (PeriodicBoundaryZ && k==nz) dummy -= COEFF[i][j][k][5] * RLL[i][j][1  ];


            RLL[i][j][k] = dummy / h_sparse_s[i][j][k];
        }

P.S. RLL for any iteration i,j,k incorporates the updated "RLL" at i-1 , j-1 and k-1 through the variable dummy. Also now the loop is only being broken down in the x direction using the directive schedule(static, nx/NTt) where NTt is just a macro for the available number of threads. Can it be broken down in all directions using the directive collapse?

------- MAJOR EDIT -------------------------- following Ajay's answer here is a minimum working example

#include
#include
#include
#include

typedef double lr;

#define nx 4
#define ny 4
#define nz 4

void
print3dmatrix(double a[nx+2][ny+2][nz+2])
{
    for(int i=1; i<= nx; i++) {
        for(int j=1; j<= ny; j++) {
            for(int k=1; k<= nz; k++) {
                printf("%f ", a[i][j][k]);
            }
            printf("
");
        }
        printf("
");
    }
}

int 
main()
{

    double a[nx+2][ny+2][nz+2];
    double b[nx+2][ny+2][nz+2];

    srand(3461833726);


    // matrix filling 
    // b is just a copy of a
    for(int i=0; i< nx+2; i++) for(int j=0; j< ny+2; j++) for(int k=0; k< nz+2; k++)
    {
        a[i][j][k] = rand() % 5;
        b[i][j][k] = a[i][j][k];
    }

    // loop 1
    //#pragma omp parallel for num_threads(1)
    for(int i=1; i<= nx; i++) for(int j=1; j<= ny; j++) for(int k=1; k<= nz; k++)
    {
        a[i][j][k] = -1*a[i-1][j][k] - 1*a[i][j-1][k] -1 * a[i][j][k-1] + 4 * a[i][j][k];
    }

    print3dmatrix(a);
    printf("******************************
");

    // loop 2
    //#pragma omp parallel for num_threads(1)
    for(int i=1; i<= nx; i++) 
        for(int j=1; j<= ny; j++)
            // #pragma omp simd
            for(int m=j+1; m<= j+nz; m++)
            {
                b[i][j][m-j] = -1*b[i-1][j][m-j] - 1*b[i][j-1][m-j] -1 * b[i][j][m-j-1] + 4 * b[i][j][m-j];
            }

    print3dmatrix(b);
    printf("=========================
");

    return 0;
}

Key observations-

Matrix a is filled with random numbers in between 0 to 5 and loop 1 is non transformed original loop whereas loop 2 is a transformed loop
The transformed loop has been skewed so as to remove dependency.
After the operation matrix a and b are same if run without the openmp parallelization
if open mp is deployed the answers change (because of maybe race conditions) [ loop is not paralellisable irrespective of where the pragma is placed ]
if #pragma omp simd is used to enforce vectorisation of innermost loop it fails.

Vectorisation of for-loop with data dependecy

Answers (1)

Related Questions