iltp38
iltp38

Reputation: 529

OpenMP Matrix Multiplcation Critical Section

I am trying to parallelize just the innermost loop of matrix multiplication. However, whenever there is more than 1 thread, the matrix multiplication does not store the correct values in the output array, and I am trying to figure out why.

void matrix() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++){ 
        sum = 0;
        #pragma omp parallel for shared(sum,i,j) private(k)
            for (k = 0; k < N; k++) {
                #pragma omp critical
                    sum = sum + A[i][k] * B[k][j];
            }
        C[i][j] = sum;
    }
}
}

I also tried using:

void matrix() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++){ 
        sum = 0;
        #pragma omp parallel for shared(sum,i,j) private(k)
            for (k = 0; k < N; k++) {
                #pragma omp atomic
                    sum += A[i][k] * B[k][j];
            }
        C[i][j] = sum;
    }
}
}

But that didn't work either. I also tried it without the second #pragma, and with:

void matrixC() {
int i,j,k,sum,np;
for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++){ 
        sum = 0;
        #pragma omp parallel for reduction(+:sum)
            for (k = 0; k < N; k++) {
                    sum = sum + A[i][k] * B[k][j];
            }
        C[i][j] = sum;
    }
}
}

I'm new to OpenMP but from everything I've read online, at least one of these solutions should work. I know its probably a problem with the race condition while adding to sum, but I have no idea why it's still getting the wrong sums.

EDIT: Here is a more complete version of the code:

double A[N][N];
double B[N][N];
double C[N][N];
int CHOOSE = CH;

void matrixSequential() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++) {
        sum = 0;
        for (k = 0; k < N; k++) {
            sum += A[i][k] * B[k][j];
        }
        C[i][j] = sum;
    }
}
}

void matrixParallel() {
int i,j,k,sum;
for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++){ 
        sum = 0;
        #pragma omp parallel for shared (i,j) private(k) reduction(+:sum)
            for (k = 0; k < N; k++) {
                sum = sum + A[i][k] * B[k][j];
            }
        C[i][j] = sum;
    }
}
}

int main(int argc, const char * argv[]) {
//populating arrays
int i,j;
for(i=0; i < N; i++){
    for(j=0; j < N; j++){
        A[i][j] = i+j;
        B[i][j] = i+j;
    }
}

for(i=0; i < N; i++){
    for(j=0; j < N; j++){
        C[i][j] = 0;
    }
}

if (CHOOSE == 0) {
    matrixSequential();
}
else if(CHOOSE == 1) {
    matrixParallel();
}

//checking for correctness
double sum;
for(i=0; i < N; i++){
    sum += C[i][i];
}
printf("Sum of diagonal elements of array C: %f \n", sum);
return 0;
}

Upvotes: 0

Views: 1858

Answers (2)

Davislor
Davislor

Reputation: 15144

Making sum a reduction variable is the right way of doing this and should work (see https://computing.llnl.gov/tutorials/openMP/#REDUCTION). Note that you still have to declare your shared and private variables, such as k.

Update

After you updated to provide a MVCE, @Zboson found the actual bug: you were declaring the arrays as double but adding them as int.

Upvotes: 2

Z boson
Z boson

Reputation: 33669

IEEE Floating point arithmetic is not associative i.e. (a+b)+c is not necessarily equal to a+(b+c). Therefore the order you reduce an array matters. When you distribute the array elements among different threads it changes the order from a sequential sum. The same thing can happen using SIMD. See for example this excellent question using SIMD to do a recution: An accumulated computing error in SSE version of algorithm of the sum of squared differences.

Your compiler won't normally use associative floating point arithmetic unless you tell it. e.g. with -Ofast, or -ffast-math, or -fassocitaive-math with GCC. For example in order to use auto-vectoirization (SIMD) for a reduction the compile requires associtive math.

However, when you use OpenMP it automatically assumes associative math at least for distributing the chunks (within the chucks the compiler still won't use associative arithmetic unless you tell it to) breaking IEEE floating point rules. Many people are not aware of this.

Since the reduction depends on the order you may be interested in a result which reduces the numerical uncertainty. One solution is to use Kahan summation.

Upvotes: 0

Related Questions