How can I fix this problem I have with OpenMP's built-in vector reduction?

Question

I am developing my own implementation of sparse BLAS functions for CSC storage formats. To do so, I created the following data structure:

typedef struct SparseMatrixCSC {
  int m;            // Number of rows
  int n;            // Number of columns
  int nnz;          // Number of stored values
  double *val;      // Stored values
  int *row_idx;     // Row indices of stored values
  int *col_start;   // Column j contains values with indices in col_start[j]:(col_start[j+1]-1)
} SparseMatrixCSC;

Then, I wanted to use OpenMP in order to parallelize the matrix-vector product (SpMV). I used different approaches to circumvent race conditions.

First, I used atomic operations as follows:

void dcscmv_atomic(SparseMatrixCSC *A, double *x, double *y) {
  for (int i=0; im; i++) y[i] = 0.;
  #pragma omp parallel for
  for (int j=0; jn; j++) {
    for (int ii=A->col_start[j]; iicol_start[j+1]; ii++) {
      #pragma omp atomic
      y[A->row_idx[ii]] += A->val[ii] * x[j];
    }
  }
}

This works fine, but it is terribly slow and actually rather yields slowdown than speedup.

Second, I tried to use OpenMP's built-in vector reduction feature as follows:

void dcscmv_builtin_array_reduction(SparseMatrixCSC *A, double *x, double *y) {
  for (int i=0; im; i++) y[i] = 0.;
  #pragma omp parallel for reduction(+:y[:A->m])
  for (int j=0; jn; j++) {
    for (int ii=A->col_start[j]; iicol_start[j+1]; ii++) {
      y[A->row_idx[ii]] += A->val[ii] * x[j];
    }
  }
}

While this code compiles correctly, it works correctly with one single thread, but it leads to a segmentation fault when using multiple threads.

Third, since I could not get OpenMP's built-in vector reduction to work, I tried coding my own reduction as follows:

void dcscmv_array_reduction_from_scratch(SparseMatrixCSC *A, double *x, double *y) {
  for (int i=0; im; i++) y[i] = 0.;
  double *YP;
  #pragma omp parallel 
  {
    int P = omp_get_num_threads();
    int p = omp_get_thread_num();
    #pragma omp single
    {
      YP = (double*)mkl_malloc(A->m * P * sizeof(double), sizeof(double));
      for (int i=0; im*P; i++) YP[i] = 0.;
    }
    #pragma omp for
    for (int j=0; jn; j++) {
      for (int ii=A->col_start[j]; iicol_start[j+1]; ii++) {
        YP[p * A->m + A->row_idx[ii]] += A->val[ii] * x[j];
      }
    }
    #pragma omp for
    for (int i=0; im; i++) {
      for (int p=0; pm * p + i];
      }
    }
  }
  mkl_free(YP);
}

This function worked, but still gave me a slowdown, although not as bad as with dcscmv_atomic.

I still have hope that if I get dcscmv_builtin_vector_reduction to work, I might be able to get some speedup. Hence, my question is: what is wrong with the way I implemented the vector reduction in dcscmv_builtin_vector_reduction, and how can I get rid of this segmentation fault?

I tried to apply OpenMP's buit-in vector reduction feature with #pragma omp parallel for reduction(+:y[:A->m]), but although the code compiled, it results in a segmentation fault at execution with multiple threads.

How can I fix this problem I have with OpenMP's built-in vector reduction?

Answers (1)

Related Questions

How can I fix this problem I have with OpenMP&#39;s built-in vector reduction?

Answers (1)

Related Questions

How can I fix this problem I have with OpenMP's built-in vector reduction?