Sergei
Sergei

Reputation: 31

Strange slowdown when using openmp

I am trying to increase performance of a rather complex iteration algorithm by parallelizing matrix multiplication, which is being called on each iteration. The algorithm takes 500 iterations and approximately 10 seconds. But after parallelizing matrix multiplication it slows down to 13 seconds. However, when I tested matrix multiplication of the same dimension alone, there was an increase in speed. (I am talking about 100x100 matrices.)

Finally, I switched off any parallelizing inside the algorithm and added on each iteration the following piece of code, which does absolutely nothing and presumably shouldn't take long:

int j;

#pragma omp parallel for private(j)

for (int i = 0; i < 10; i++)
j = i;

And again, there is a 30% slowdown comparing to the same algorithm without this piece of code.

Thus, calling any parallelization using openmp 500 times inside the main algorithm somehow slows things down. This behavior looks very strange to me, anybody has any clues what the problem is?

The main algorithm is being called by a desktop application, compiled by VS2010, Win32 Release. I work on Intel Core i3 (parallelization creates 4 threads), 64 bit Windows 7.

Here is a structure of a program:

int internal_method(..)

{
...//no openmp here


 // the following code does nothing, has nothing to do with the rest of the program  and shouldn't take long,
 // but somehow adding of this code caused a 3 sec slowdown of the Huge_algorithm()
 double sum;
 #pragma omp parallel for private(sum)
 for (int i = 0; i < 10; i++)
    sum = i*i*i / (1.0 + i*i*i*i);

...//no openmp here
}


int Huge_algorithm(..)
{

 ...//no openmp here

    for (int i = 0; i < 500; i++)
    {
     .....// no openmp

     internal_method(..);

     ......//no openmp
    }

...//no openmp here
}

So, the final point is: calling the parallel piece of code 500 times alone (when the rest of the algorithm is omitted) takes less than 0.01 sec, but when you call it 500 times inside a huge algorithm it causes 3 sec delay of the entire algorithm. And what I don't understand is how the small parallel part affects the rest of the algorithm?

Upvotes: 3

Views: 1457

Answers (2)

Sebastian Mach
Sebastian Mach

Reputation: 39089

For 10 iterations and a simple assignment, I guess there is too much OpenMP overhead compared to the computation itself. What looks lightweight here is actually managing and synchronizing multiple threads which may not even come from a thread pool. There might be some locking involved, and I don't know how good MSVC is at estimating whether to parallelize at all.

Try with bigger loop bodies or a bigger amount of iterations (say 1024*1024 iterations, just for starters).


Example OpenMP Magick:

#pragma omp parallel for private(j)
for (int i = 0; i < 10; i++)
    j = i;

This might be approximately expanded by a compiler to:

const unsigned __cpu_count = __get_cpu_count();
const unsigned __j  = alloca (sizeof (unsigned) * __cpu_count);
__thread *__threads = alloca (sizeof (__thread) * __cpu_count);
for (unsigned u=0; u!=__cpu_count; ++u) {
    __init_thread (__threads+u);
    __run_thread ([u]{for (int i=u; i<10; i+=__cpu_count)
                          __j[u] = __i;}); // assume lambdas
}

for (unsigned u=0; u!=__cpu_count; ++u)
    __join (__threads+u);

with __init_thread(), __run_thread() and __join() being non-trivial function that invoke certain system calls.

In case thread-pools are used, you would replace the first alloca() by something like __pick_from_pool() or so.

(note this, names and emitted code, was all imaginary, actual implementation will look different)


Regarding your updated question:

You seem to be parallelizing at the wrong granularity. Put as much workload as possible in a thread, so instead of

 for (...) {
     #omp parallel ...
     for (...) {} 
 }

try

 #omp parallel ...
 for (...) {
     for (...) {} 
 }

Rule of thumb: Keep workloads big enough per thread so as to reduce relative overhead.

Upvotes: 2

huseyin tugrul buyukisik
huseyin tugrul buyukisik

Reputation: 11910

Maybe just j=i is not high-yield for core-cpu bandwith. maybe you should try something more yielding calculation. (for exapmle taking i*i*i*i*i*i and dividing it by i+i+i)

are you running this on multi-core cpu or gpu?

Upvotes: 0

Related Questions