Reputation: 5
I am developing my own implementation of sparse BLAS functions for CSC storage formats. To do so, I created the following data structure:
typedef struct SparseMatrixCSC {
int m; // Number of rows
int n; // Number of columns
int nnz; // Number of stored values
double *val; // Stored values
int *row_idx; // Row indices of stored values
int *col_start; // Column j contains values with indices in col_start[j]:(col_start[j+1]-1)
} SparseMatrixCSC;
Then, I wanted to use OpenMP in order to parallelize the matrix-vector product (SpMV). I used different approaches to circumvent race conditions.
First, I used atomic operations as follows:
void dcscmv_atomic(SparseMatrixCSC *A, double *x, double *y) {
for (int i=0; i<A->m; i++) y[i] = 0.;
#pragma omp parallel for
for (int j=0; j<A->n; j++) {
for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
#pragma omp atomic
y[A->row_idx[ii]] += A->val[ii] * x[j];
}
}
}
This works fine, but it is terribly slow and actually rather yields slowdown than speedup.
Second, I tried to use OpenMP's built-in vector reduction feature as follows:
void dcscmv_builtin_array_reduction(SparseMatrixCSC *A, double *x, double *y) {
for (int i=0; i<A->m; i++) y[i] = 0.;
#pragma omp parallel for reduction(+:y[:A->m])
for (int j=0; j<A->n; j++) {
for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
y[A->row_idx[ii]] += A->val[ii] * x[j];
}
}
}
While this code compiles correctly, it works correctly with one single thread, but it leads to a segmentation fault when using multiple threads.
Third, since I could not get OpenMP's built-in vector reduction to work, I tried coding my own reduction as follows:
void dcscmv_array_reduction_from_scratch(SparseMatrixCSC *A, double *x, double *y) {
for (int i=0; i<A->m; i++) y[i] = 0.;
double *YP;
#pragma omp parallel
{
int P = omp_get_num_threads();
int p = omp_get_thread_num();
#pragma omp single
{
YP = (double*)mkl_malloc(A->m * P * sizeof(double), sizeof(double));
for (int i=0; i<A->m*P; i++) YP[i] = 0.;
}
#pragma omp for
for (int j=0; j<A->n; j++) {
for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
YP[p * A->m + A->row_idx[ii]] += A->val[ii] * x[j];
}
}
#pragma omp for
for (int i=0; i<A->m; i++) {
for (int p=0; p<P; p++) {
y[i] += YP[A->m * p + i];
}
}
}
mkl_free(YP);
}
This function worked, but still gave me a slowdown, although not as bad as with dcscmv_atomic
.
I still have hope that if I get dcscmv_builtin_vector_reduction
to work, I might be able to get some speedup. Hence, my question is: what is wrong with the way I implemented the vector reduction in dcscmv_builtin_vector_reduction
, and how can I get rid of this segmentation fault?
I tried to apply OpenMP's buit-in vector reduction feature with #pragma omp parallel for reduction(+:y[:A->m])
, but although the code compiled, it results in a segmentation fault at execution with multiple threads.
Upvotes: 0
Views: 51
Reputation: 5810
Your problem is with the indirect addressing which makes the loop not trivially parallel. As you have noticed, simple solutions don't work or are slow. Two ways out.
Upvotes: 0