Kaare
Kaare

Reputation: 531

Using openmp better

I am trying to implement openMP, but like so many other posters before me, the result has simply been to slow the code down. Inspired by previous answers, I went from using #pragma omp parallel for to #pragma omp task, in the hope that I could avoid some overhead. Unfortunately, the parallelized code is still twice as slow as the serial. From other answers, it seems that the proper procedure depends on the specific demands of the code, which is why I thought I would have to ask a question myself.

First the pseudo-code:

#pragma omp parallel
{
#pragma omp master
while (will be run some hundreds of millions of times)
{
    for (between 5 and 20 iterations)
    {
        #pragma omp task
        (something)
    }
    #pragma omp taskwait <- it is important that all the above tasks are completed before going on

    (something)

    if (something)
    {
        (something)

        for (between 50 and 200 iterations)
        {
            #pragma omp task 
            (something)
        }
        #pragma omp taskwait

        (something)
    }

}
}

Only the two for-loops can be parallelized, the rest must be done in the right order. I came up with putting the parallel and master directives outside the while-loop in an attempt at reducing the overhead of creating the team.

I am also a bit curious whether I am using taskwait properly - the specification states that the "parent task" is put on hold until all child tasks have been executed, but it is not quite clear whether that terminology also applies here, where the task regions are not nested.

Can anyone come up with a better way of using openMP, such that I may actually get a speed-up?

EDIT: each step in the while-loop depends on all previous steps, and thus they have to be done serially, with an update at the end. It is an implementation of an "event-driven-algorithm" for simulating neural networks, if anyone was wondering.

Upvotes: 2

Views: 694

Answers (3)

tune2fs
tune2fs

Reputation: 7705

For parallel programming you should also design your problem in a way such that you rarely need to sync your threads. Each time you sync your threads you will get the worst performance of all threads. If you need to sync your threads, try to redesign your problem, to avoid these syncs.

Tweaking your code from #pragma omp parallel for to #pragma omp task won't get you any significant improvments, as their execution time difference is normally neglectable. Before trying to tweak some routine calls or omp statments you need to adapt your problem to parallel execution. You need really think in "parallel" to get a good and scalable performace increase, just adapting serial code rarely works.

In your code you should try to parallize the while loop and not the inner for loops. If you parallize the small for loop you will not get any significant performance increase.

Upvotes: 2

pyCthon
pyCthon

Reputation: 12341

did you remember to set your enviroment variables accordingly? OMP_NUM_THREADS = N , where N is the number of threads or cores supported by your processor

Upvotes: 0

martiert
martiert

Reputation: 741

I'm not sure if a task is the right way to go here. I'm not to familiar with tasks, but it seems like it starts a thread each time you encounter a #pragma omp task. I would rather try something like:

while (will be run some hundreds of millions of time)
{
#pragma omp parallel
{
    for (between 5 and 20 iterations)
    {
        (something) 
    }
#pragma omp single/master
{

    (something)
    bool flag = false;
    if (something)
    {
        (something)
        flag = true;
    }
}

    if (flag)
    {
        for (between 50 and 200 iterations)
        {
            (something)
        }
    }
#pragma omp single/master
{
            (something)
}
    }
    }

It's also important to remember that the tasks in the for loops might be to small for parallel execution to give any speedup, as there is an overhead in starting and syncing threads. You should also look at the possibility of rewriting your program so you don't need to synchronize your threads, which you right now do quite a lot. My guess is that your algorithm and workload is actually to small for parallel execution to give any speedup as it is written right now.

Upvotes: 0

Related Questions