Convex Leopard
Convex Leopard

Reputation: 121

Matrix Multiplication OpenMP Counter-Intuitive Results

I am currently porting some code over to OpenMP at my place of work. One of the tasks I am doing is figuring out how to speed up matrix multiplication for one of our applications.

The matrices are stored in row-major format, so A[i*cols +j] gives the A_i_j element of the matrix A.

The code looks like this (uncommenting the pragma parallelises the code):

#include <omp.h>
#include <iostream>
#include <iomanip>
#include <stdio.h>

#define NUM_THREADS 8
#define size 500
#define num_iter 10

int main (int argc, char *argv[])
{
//    omp_set_num_threads(NUM_THREADS);

    int *A = new int [size*size];
    int *B = new int [size*size];
    int *C = new int [size*size];

    for (int i=0; i<size; i++)
    {
        for (int j=0; j<size; j++)
        {
            A[i*size+j] = j*1;
            B[i*size+j] = i*j+2;
            C[i*size+j] = 0;
        }
    }

    double total_time = 0;
    double start = 0;

    for (int t=0; t<num_iter; t++)
    {
        start = omp_get_wtime();

        int i, k;

//        #pragma omp parallel for  num_threads(10) private(i, k) collapse(2) schedule(dynamic)
        for (int j=0; j<size; j++)
        {
            for (i=0; i<size; i++)
            {
                for (k=0; k<size; k++)
                {
                    C[i*size+j] += A[i*size+k] * B[k*size+j];
                }
            }
        }

        total_time += omp_get_wtime() - start;
    }

    std::setprecision(5);
    std::cout << total_time/num_iter << std::endl;

    delete[] A;
    delete[] B;
    delete[] C;

    return 0;
}

What is confusing me is the following: why is dynamic scheduling faster than static scheduling for this task? Timing the runs and taking an average shows that static scheduling is slower, which to me is a bit counterintuitive since each thread is doing the same amount of work.

Also, am I correctly speeding up my matrix multiplication code?

Upvotes: 0

Views: 69

Answers (1)

Jim Cownie
Jim Cownie

Reputation: 2859

Parallel matrix multiplication is non-trivial (have you even considered cache-blocking?). Your best bet is likely to be to use a BLAS Library for this, rather than writing it yourself. (Remember, "The best code is the code I do not have to write").

Wikipedia: Basic Linear Algebra Subprograms points to many implementations, a lot of which (including Intel Math Kernel Library) have free licenses.

Upvotes: 1

Related Questions