datapanda
datapanda

Reputation: 507

Vectorisation of for-loop with data dependecy

I have an implementation of BiCCG (Conjugate Gradient) based matrix solver which also accounts for periodicity. It happens to be the case that the implementation is compute intensive and the loop is not auto vectorized because of the dependency problem. I explored a bit and it seems a red-black gauss seidel algorithm is more efficiently parallelizable than a vanilla version(which also has a similar dependency problem).

Can this loop/algorithm be changed such that it can be efficiently vectorised?

 // FORWARD
        #pragma omp for schedule(static, nx/NTt)
        for (i=1; i<=nx; i++) for (j=1; j<=ny; j++) for (k=1; k<=nz; k++)
        {


            dummy = res_sparse_s[i][j][k];

                                           dummy -= COEFF[i][j][k][7] * RLL[i-1][j][k];
            if (PeriodicBoundaryX && i==nx)dummy -= COEFF[i][j][k][8] * RLL[1  ][j][k];


                                            dummy -= COEFF[i][j][k][2] * RLL[i][j-1][k];
            if (PeriodicBoundaryY && j==ny) dummy -= COEFF[i][j][k][3] * RLL[i][1  ][k];


                                            dummy -= COEFF[i][j][k][4] * RLL[i][j][k-1];
            if (PeriodicBoundaryZ && k==nz) dummy -= COEFF[i][j][k][5] * RLL[i][j][1  ];


            RLL[i][j][k] = dummy / h_sparse_s[i][j][k];
        }

P.S. RLL for any iteration i,j,k incorporates the updated "RLL" at i-1 , j-1 and k-1 through the variable dummy. Also now the loop is only being broken down in the x direction using the directive schedule(static, nx/NTt) where NTt is just a macro for the available number of threads. Can it be broken down in all directions using the directive collapse?

------- MAJOR EDIT -------------------------- following Ajay's answer here is a minimum working example

#include<stdio.h>
#include<stdlib.h>
#include<time.h>
#include<omp.h>

typedef double lr;

#define nx 4
#define ny 4
#define nz 4

void
print3dmatrix(double a[nx+2][ny+2][nz+2])
{
    for(int i=1; i<= nx; i++) {
        for(int j=1; j<= ny; j++) {
            for(int k=1; k<= nz; k++) {
                printf("%f ", a[i][j][k]);
            }
            printf("\n");
        }
        printf("\n");
    }
}

int 
main()
{

    double a[nx+2][ny+2][nz+2];
    double b[nx+2][ny+2][nz+2];

    srand(3461833726);


    // matrix filling 
    // b is just a copy of a
    for(int i=0; i< nx+2; i++) for(int j=0; j< ny+2; j++) for(int k=0; k< nz+2; k++)
    {
        a[i][j][k] = rand() % 5;
        b[i][j][k] = a[i][j][k];
    }

    // loop 1
    //#pragma omp parallel for num_threads(1)
    for(int i=1; i<= nx; i++) for(int j=1; j<= ny; j++) for(int k=1; k<= nz; k++)
    {
        a[i][j][k] = -1*a[i-1][j][k] - 1*a[i][j-1][k] -1 * a[i][j][k-1] + 4 * a[i][j][k];
    }

    print3dmatrix(a);
    printf("******************************\n");

    // loop 2
    //#pragma omp parallel for num_threads(1)
    for(int i=1; i<= nx; i++) 
        for(int j=1; j<= ny; j++)
            // #pragma omp simd
            for(int m=j+1; m<= j+nz; m++)
            {
                b[i][j][m-j] = -1*b[i-1][j][m-j] - 1*b[i][j-1][m-j] -1 * b[i][j][m-j-1] + 4 * b[i][j][m-j];
            }

    print3dmatrix(b);
    printf("=========================\n");

    return 0;
}

Key observations-

  1. Matrix a is filled with random numbers in between 0 to 5 and loop 1 is non transformed original loop whereas loop 2 is a transformed loop
  2. The transformed loop has been skewed so as to remove dependency.
  3. After the operation matrix a and b are same if run without the openmp parallelization
  4. if open mp is deployed the answers change (because of maybe race conditions) [ loop is not paralellisable irrespective of where the pragma is placed ]
  5. if #pragma omp simd is used to enforce vectorisation of innermost loop it fails.

Upvotes: 3

Views: 156

Answers (1)

Ajay Brahmakshatriya
Ajay Brahmakshatriya

Reputation: 9203

This is a classic problem of loop carried dependences. Every iteration of yours depend on some other iterations (to have finished), and the only way it can be scheduled is thus serially.

But that is just because how your loop is written.

You mention that R[i][j][k] depends on the calculation of R[i-1][j][k], R[i][j-1][k], R[i][j][k-1]. I see three dependences here -

  1. [1, 0, 0]
  2. [0, 1, 0]
  3. [0, 0, 1]

I hope this representation is intuitive.

For your present scenario, dependence 1) and 2) are not an issue because there is a 0 in k and there are 1 in i/j, which means that the iteration does not depend on previous iterations of k to complete for these two dependences.

The problem is because of 3). Since there is a 1 in k, every iteration depends on it's previous iteration. If somehow we were able to bring a number >0 in i/j we would be done. A loop skew transformation lets us do exactly the same.

A 3D example is slightly difficult to understand. So let's look at 2D example with i and j.

Suppose - R[i][j] depends on R[i-1][j] and R[i][j-1]. We have the same problem.

If we have to represent this in a picture it looks like this -

. <- . <- .
     |    |
     v    v
. <- . <- .
     |    |
     v    v
.    .    .

In this picture, every point represents and iteration (i,j) and the arrows originating from each point, point to the iteration it depends on. It is clear to see why we cannot parallelize the inner most loop here.

But suppose we did the skewing as -

        .
       /|   
      / |
    .   .
   /|  /|
  / | / |
.   .   . 
   /|    
  / |  
.   .     


. 

And if you draw the same arrows as in the above picture (I cannot draw diagonal arrows in the ASCII art).

You will see that all the arrows are pointing downwards i.e. they atleast go on iteration down, which means you can parallelize the horizontal loop.

Now say your new loop dimensions are y (outer loop) and x (inner loop),

your original variables i, j will be

j = x and i = x - y

Your loop body thus becomes -

for ( y = 0; y < j_max + i_max; y++) 
    for ( x = 0; x < j_max; x++)
        R_dash[y][x] = R_dash[y-1][x-1] + R_dash[y-1][x];

Where R_dash is the skewed domain and has a one to one mapping to R

You will see that both R_dash[y-1][x-1] and R_dash[y-1][x] will be computed in some previous iteration of y. And hence you can completely parallelize the x loop.

The transformation applied here is

i -> i, j -> i + j.

You can similarly work it out for 3 dimensions.

For further understanding on how affine transformations work and how they can be used to introduce parallelism, you can see these lecture notes.

Upvotes: 2

Related Questions