Reputation: 43
I have a problem: parallel version of LU decomposition algorithm is running same time as sequence:
void lup_od_omp(double* a, int n){
int i,j,k;
for(k = 0; k < n - 1; ++k)
{
#pragma omp parallel for shared(a,n,k) private(i,j)
for(i = k + 1; i < n; i++)
{
a[i*n + k] /= a[k*n + k];
for(j = k + 1; j < n; j++)
{
a[i*n + j] -= a[i*n + k]*a[k*n + j];
}
}
}}
Maybe i'm doing something wrong?
Upvotes: 3
Views: 5071
Reputation: 9781
The main problem of your code is that you decomposed the workload in a bad way.
For a single LU decomposition, you invoke parallel for n-1
times. On each time, parallel for will do thread fork and join, which introduces a lot of overhead. Especially when k
is large, the inner loop (for(i){for(j){...}}
) contains only very little work. Paralleling it will be quite inefficient.
You may consider use proper agglomeration schemes to reduce the overhead. For more info please refer to this slides.
http://courses.engr.illinois.edu/cs554/notes/06_lu_8up.pdf
On the other hand, you could use existing performance libraries to get the max performance on LU factorization, such as Intel MKL
http://software.intel.com/en-us/node/468682
Upvotes: 1
Reputation: 22542
Since you are only working on two cores your parallelisation may actually get in the way of the vectoriser. Vectorisation on SSE2 will give you a data bandwidth of 2 doubles per op, 4 on AVX.
Dual thread has a lot of synchronisation overhead which may slow you down especially if you loose vectorisation. Also for some reason your #pragma omp
does not start any threads unless omp_set_num_threads
was invoked to actually make it use threads.
Another thing which is also related to vectorisation is that not all compilers understand that a[i*n + j]
is intended to address a two-dimensional array, so it's better to declare it as such in the first place.
Here is a slight optimisation of your code that runs fairly well on my Xeon:
void lup_od_omp(int n, double (*a)[n]){
int i,k;
for(k = 0; k < n - 1; ++k) {
// for the vectoriser
for(i = k + 1; i < n; i++) {
a[i][k] /= a[k][k];
}
#pragma omp parallel for shared(a,n,k) private(i) schedule(static, 64)
for(i = k + 1; i < n; i++) {
int j;
const double aik = a[i][k]; // some compilers will do this automatically
for(j = k + 1; j < n; j++) {
a[i][j] -= aik * a[k][j];
}
}
}
}
Runtimes for an array of 3000x3000 icc -O2
:
Your code sequential: 0:24.61 99% CPU
Your code 8 threads : 0:05.21 753% CPU
My code sequential: 0:18.53 99% CPU
My code 8 threads : 0:05.42 766% CPU
And on a different machine I tested it on AVX (256-bit vectors, 4 doubles per op):
My code on AVX sequential : 0:09.45 99% CPU
My code on AVX 8 threads : 0:03.92 766% CPU
As you can see I have improved the vectoriser a little, but didn't do much for the parallel section.
Upvotes: 4