Dylan
Dylan

Reputation: 31

Cache management for sparse matrix multiplication using OpenMP

I am having issues with what I think is some false caching, I am only getting a small speedup when using the following code compared to not the unparalleled version.

matrix1 and matrix2 are sparse matrices in a struct with (row, col, val) format.

void pMultiply(struct SparseRow *matrix1, struct SparseRow *matrix2, int m1Rows, int m2Rows, struct SparseRow **result) {

*result = malloc(1 * sizeof(struct SparseRow));

int resultNonZeroEntries = 0;

#pragma omp parallel for atomic
for(int i = 0; i < m1Rows; i++)
{
    int curM1Row = matrix1[i].row;
    int curM1Col = matrix1[i].col;
    float curM1Value = matrix1[i].val;

    for(int j = 0; j < m2Rows; j++)
    {

        int curM2Row = matrix2[j].row;
        int curM2Col = matrix2[j].col;
        float curM2Value = matrix2[j].val;

        if(curM1Col == curM2Row)
        {
            *result = realloc(*result, 
            (sizeof(struct SparseRow)*(resultNonZeroEntries+1)));

            (*result)[resultNonZeroEntries].row = curM1Row;
            (*result)[resultNonZeroEntries].col = curM2Col;
            (*result)[resultNonZeroEntries].val += curM1Value*curM2Value;
            resultNonZeroEntries++;
            break;
        }

    }
}

Upvotes: 1

Views: 335

Answers (1)

Brice
Brice

Reputation: 1580

Several issues there:

  • As mentionned by Brian Brochers, the #pragma omp atomic clause should be put just before the line that needs to be protected against a race condition.
  • Reallocating memory at each step is likely a performance killer. If the memory cannot be reallocated in place and needs to be copied elsewhere, this would be slow. It is also a source of errors, as the value of pointer result is modified. Other threads keep running wile the reallocation takes place and may try to access memory at the "old" address, or several threads may try to reallocate results concurrently. Placing the whole realloc + addition part in a critical section would be safer, but will essentially serialize the function for anything but testing equality of line/column indices at the cost of a significant overhead. Threads should work with on local buffer, then merge their results at a later stage. Reallocation should be done by blocks of sufficient size.

    // Make sure this will compile even without openmp + include memcpy
    #include <string.h>
    #ifdef _OPENMP
       #define thisThread omp_thread_num()
       #define nThreads omp_num_threads()
    #else
       #define  thisThread 0
       #define  nThreads 1
    #endif
    
    // shared variables
    int totalNonZero,*copyIndex,*threadNonZero;
    #pragma omp parallel
    {
    // each thread now initialize a local buffer and local variables 
    int localNonZero = 0;
    int allocatedSize = 1024;
    SparseRow *localResult = malloc(allocatedSize  * sizeof(*SparseRow));
    
    // one thread initialize an array
    #pragma omp single
    {
    threadNonZero=malloc(nThreads*sizeof(int));copyIndex=malloc((nThreads+1)*sizeof(int));
    }
    
    #pragma omp for
    for (int i = 0; i < m1Rows; i++){
        /* 
         * do the same as your initial code but:
         * realloc an extra 1024 lines each time localNonZeros exceeds allocatedSize
         * fill the local buffer and increment the localNonZeros counter
         * this is safe, no need to use critical / atomic clauses
         */
        }
    
    copyIndex[thisThread]=localNonZero; //put number of non zero into a shared variable
    #pragma omp barrier
    
    // Wrap_up : check how many non zero values for each thread, allocate the output and check where each thread will copy its local buffer
    #pragma omp single
    {
        copyIndex[0]=0;
        for (int i=0; i<nThreads; ii++)
            copyIndex[i+1]=localNonZero[i]+copyIndex[i];
        result=malloc(copyIndex[nThreads+1]*sizeof(*SparseRow));
    }
    
    // Copy the results from local to global result   memcpy(&result[copyIndex[thisThread]],localResult,localNonZero*sizeof(*SparseRow);
    
    // Free memory
    free(localResult);
    #pragma omp single
    {
    free(copyIndex);
    free(localNonZero);
    }
    } // end parallel
    
  • Please note that the algorithm will generate duplicates e.g. if the first matrix contains values at positions (1,10) and (1,20) and the second one (10,5) and (20,5), there will be two (1,5) lines in the result. At some point, a compaction function that merge duplicate lines will be needed.

Upvotes: 0

Related Questions