Rahul Mayuranath
Rahul Mayuranath

Reputation: 1

openmp - Parallel Vector Matrix Product

I am computing a vector matrix(The matrix is sparse and stored in CSR) product(Not exactly a product, a slight variation to compute shortest distance) using the outer product method. I am new to parallel programming and essentially trying to understand the difference between using a parallel for section with a critical section for the update VS using tasks and doing reduction. Which is the better approach and why?

Note: This function call is enclosed with a omp parallel and an omp single.

Using parallel for approach, I am doing this:

double *matrixVectorHadamard(CSR *A, double *T, double *tB, double *tReq) {
    initialize_T(tReq);
    int index;
    #pragma omp parallel for schedule(static, BLOCK_SIZE)
    for(int i=0;i<N;i++) {
        int num_edges = A->row_ptr[i+1] - A->row_ptr[i];
        index = 0;
        if(num_edges) {
            if(T[i] != INFINITY && tB[i] != INFINITY) {
                for(int j=0;j<num_edges;j++) {
                    index = A->col_ind[A->row_ptr[i] + j];
                    #pragma omp critical 
                    tReq[index] = min(tReq[index], T[i]+A->val[A->row_ptr[i]+j]);      
                }
            }
        }
    }
    return tReq;
}

Using the task based approach with reduction, this is essentially my idea:

int size = N/BLOCK_SIZE + 1;
double C[size][N];

double *matrixVectorHadamard(CSR *A, double *T, double *tB, double *tReq, int size, double C[][N]) {

    int index;

    for(int i=0;i<size;i++) {
        for(int j=0;j<N;j++) {
            C[i][j] = INFINITY;
            tReq[j] = INFINITY;
        }
    }

    int k;

    for(k=0;k<size-1; k++) {
        #pragma omp task firstprivate(k) depend(inout: C[k]) 
        {
            int index;
            int delim;   
            delim = (k==size-1) ? N : BLOCK_SIZE;
            printf("kk is %d\n", k*BLOCK_SIZE);
            // printf("k is %d Delim is %d for thread %d\n", k, delim, omp_get_thread_num());
            for(int i=0;i<delim; i++) {
                int num_edges = A->row_ptr[k*BLOCK_SIZE + i+1] - A->row_ptr[k*BLOCK_SIZE + i];
                index = 0;
                if(num_edges) {
                    if(T[k*BLOCK_SIZE + i] != INFINITY && tB[k*BLOCK_SIZE + i] != INFINITY) {           
                        for(int j=0;j<num_edges;j++) {
                            index = A->col_ind[A->row_ptr[k*BLOCK_SIZE + i] + j];
                            {
                            C[k][index] = min(C[k][index], T[k*BLOCK_SIZE + i]+A->val[A->row_ptr[k*BLOCK_SIZE + i]+j]);                 
                            }
                        }
                    }       
                }   
            }
        }
    }    

    #pragma omp taskwait

    for(int i=0; i<N; i++) {
        {
            double minimum = INFINITY;
            for(int j=0;j<size;j++) {
                if(C[j][i] < minimum) {
                    minimum = C[j][i];
                }
            }
            tReq[i] = minimum;
        }
    }

    return tReq;
}

Essentially, is there any downsides to using parallel for compared to the task based approach?

Upvotes: 0

Views: 440

Answers (1)

Zulan
Zulan

Reputation: 22670

You are right that you have basically two options: Protect the data update or use thread-specific copies. However, you can do much better for each option:

When going with protected updates, you should protect only what and when absolutely necessary. You can use an initial atomic check to prevent critical regions most of the time similar to a double checked lock pattern.

double *matrixVectorHadamard(CSR *A, double *T, double *tB, double *tReq) {
    initialize_T(tReq);
    #pragma omp parallel for schedule(static, BLOCK_SIZE)
    for(int i=0;i<N;i++) {
        int num_edges = A->row_ptr[i+1] - A->row_ptr[i];
        if (num_edges) {
            if(T[i] != INFINITY && tB[i] != INFINITY) {
                for(int j=0;j<num_edges;j++) {
                    // !WARNING! You MUST declare index within the parallel region
                    // or explicitly declare it private to avoid data races!
                    int index = A->col_ind[A->row_ptr[i] + j];
                    double tmp = T[i] + A->val[A->row_ptr[i]+j];
                    double old;
                    #pragma omp atomic
                    old = tReq[index];
                    if (tmp < old) {
                        #pragma omp critical
                        {
                            tmp = min(tReq[index], tmp);
                            // Another atomic ensures that the earlier read
                            // outside of critical works correctly
                            #pragma omp atomic
                            tReq[index] = tmp;
                        }
                    }
                }
            }
        }
    }
    return tReq;
}

Note: Unfortunately no OpenMP/C does not support a direct atomic minimum.

The alternative is a reduction, which is even supported by the standard itself. So there is no need to reinvent the work-sharing etc. You can simply do the following:

double *matrixVectorHadamard(CSR *A, double *T, double *tB, double *tReq) {
    initialize_T(tReq);
    #pragma omp parallel for schedule(static, BLOCK_SIZE) reduction(min:tReq)
    for(int i=0;i<N;i++) {
        int num_edges = A->row_ptr[i+1] - A->row_ptr[i];
        if (num_edges) {
            if(T[i] != INFINITY && tB[i] != INFINITY) {
                for(int j=0;j<num_edges;j++) {
                    // !WARNING! You MUST declare index within the parallel region
                    // or explicitly declare it private to avoid data races!
                    int index = A->col_ind[A->row_ptr[i] + j];
                    tReq[index] = min(tReq[index], T[i]+A->val[A->row_ptr[i]+j]);
                }
            }
        }
    }
    return tReq;
}

OpenMP will magically created thread-local copies of tReq and merge (reduce) them at the end.

Which version is better for you depends on the size of the target array and the rate of writes. If you write often, the reduction will be beneficial because it is not slowed down by critical / atomic / bad caching. If you have a huge target array or not so many update-iterations, the first solution becomes more interesting because the relative overhead of creating and reducing the array.

Upvotes: 1

Related Questions