LU decomposition using openmp

Question

I have a problem: parallel version of LU decomposition algorithm is running same time as sequence:

void lup_od_omp(double* a, int n){

int i,j,k;

for(k = 0; k < n - 1; ++k)
{
    #pragma omp parallel for shared(a,n,k) private(i,j)
    for(i = k + 1; i < n; i++)
    {
        a[i*n + k] /= a[k*n + k];
        for(j = k + 1; j < n; j++)
        {
            a[i*n + j] -= a[i*n + k]*a[k*n + j];
        }
    }
}}

Maybe i'm doing something wrong?

Sergey L. · Accepted Answer

Since you are only working on two cores your parallelisation may actually get in the way of the vectoriser. Vectorisation on SSE2 will give you a data bandwidth of 2 doubles per op, 4 on AVX.

Dual thread has a lot of synchronisation overhead which may slow you down especially if you loose vectorisation. Also for some reason your #pragma omp does not start any threads unless omp_set_num_threads was invoked to actually make it use threads.

Another thing which is also related to vectorisation is that not all compilers understand that a[i*n + j] is intended to address a two-dimensional array, so it's better to declare it as such in the first place.

Here is a slight optimisation of your code that runs fairly well on my Xeon:

void lup_od_omp(int n, double (*a)[n]){
    int i,k;

    for(k = 0; k < n - 1; ++k) {
        // for the vectoriser
        for(i = k + 1; i < n; i++) {
            a[i][k] /= a[k][k];
        }

        #pragma omp parallel for shared(a,n,k) private(i) schedule(static, 64)
        for(i = k + 1; i < n; i++) {
            int j;
            const double aik = a[i][k]; // some compilers will do this automatically
            for(j = k + 1; j < n; j++) {
                a[i][j] -= aik * a[k][j];
            }
        }
    }
}

Runtimes for an array of 3000x3000 icc -O2:

Your code sequential:  0:24.61 99%  CPU
Your code 8 threads :  0:05.21 753% CPU
My   code sequential:  0:18.53 99%  CPU
My   code 8 threads :  0:05.42 766% CPU

And on a different machine I tested it on AVX (256-bit vectors, 4 doubles per op):

My code on AVX sequential :  0:09.45 99%  CPU
My code on AVX 8 threads  :  0:03.92 766% CPU

As you can see I have improved the vectoriser a little, but didn't do much for the parallel section.

LU decomposition using openmp

Answers (2)

Related Questions