Nicolas Venkovic
Nicolas Venkovic

Reputation: 5

How can I fix this problem I have with OpenMP's built-in vector reduction?

I am developing my own implementation of sparse BLAS functions for CSC storage formats. To do so, I created the following data structure:

typedef struct SparseMatrixCSC {
  int m;            // Number of rows
  int n;            // Number of columns
  int nnz;          // Number of stored values
  double *val;      // Stored values
  int *row_idx;     // Row indices of stored values
  int *col_start;   // Column j contains values with indices in col_start[j]:(col_start[j+1]-1)
} SparseMatrixCSC;

Then, I wanted to use OpenMP in order to parallelize the matrix-vector product (SpMV). I used different approaches to circumvent race conditions.

First, I used atomic operations as follows:

void dcscmv_atomic(SparseMatrixCSC *A, double *x, double *y) {
  for (int i=0; i<A->m; i++) y[i] = 0.;
  #pragma omp parallel for
  for (int j=0; j<A->n; j++) {
    for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
      #pragma omp atomic
      y[A->row_idx[ii]] += A->val[ii] * x[j];
    }
  }
}

This works fine, but it is terribly slow and actually rather yields slowdown than speedup.

Second, I tried to use OpenMP's built-in vector reduction feature as follows:

void dcscmv_builtin_array_reduction(SparseMatrixCSC *A, double *x, double *y) {
  for (int i=0; i<A->m; i++) y[i] = 0.;
  #pragma omp parallel for reduction(+:y[:A->m])
  for (int j=0; j<A->n; j++) {
    for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
      y[A->row_idx[ii]] += A->val[ii] * x[j];
    }
  }
}

While this code compiles correctly, it works correctly with one single thread, but it leads to a segmentation fault when using multiple threads.

Third, since I could not get OpenMP's built-in vector reduction to work, I tried coding my own reduction as follows:

void dcscmv_array_reduction_from_scratch(SparseMatrixCSC *A, double *x, double *y) {
  for (int i=0; i<A->m; i++) y[i] = 0.;
  double *YP;
  #pragma omp parallel 
  {
    int P = omp_get_num_threads();
    int p = omp_get_thread_num();
    #pragma omp single
    {
      YP = (double*)mkl_malloc(A->m * P * sizeof(double), sizeof(double));
      for (int i=0; i<A->m*P; i++) YP[i] = 0.;
    }
    #pragma omp for
    for (int j=0; j<A->n; j++) {
      for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
        YP[p * A->m + A->row_idx[ii]] += A->val[ii] * x[j];
      }
    }
    #pragma omp for
    for (int i=0; i<A->m; i++) {
      for (int p=0; p<P; p++) {
        y[i] +=  YP[A->m * p + i];
      }
    }
  }
  mkl_free(YP);
}

This function worked, but still gave me a slowdown, although not as bad as with dcscmv_atomic.

I still have hope that if I get dcscmv_builtin_vector_reduction to work, I might be able to get some speedup. Hence, my question is: what is wrong with the way I implemented the vector reduction in dcscmv_builtin_vector_reduction, and how can I get rid of this segmentation fault?

I tried to apply OpenMP's buit-in vector reduction feature with #pragma omp parallel for reduction(+:y[:A->m]), but although the code compiled, it results in a segmentation fault at execution with multiple threads.

Upvotes: 0

Views: 51

Answers (1)

Victor Eijkhout
Victor Eijkhout

Reputation: 5810

Your problem is with the indirect addressing which makes the loop not trivially parallel. As you have noticed, simple solutions don't work or are slow. Two ways out.

  1. You can give each thread its own output vector and then merge them in the end.
  2. Instead of a column-based algorithm, use a row-based one. Then each component of the output is independently computable and therefore parallel. Yes, I know that you are storing your matrix by column, but even then it is possible. You can probably find an algorithm for the Transpose product with a CRS matrix. That's basically it.

Upvotes: 0

Related Questions