arbitUser1401
arbitUser1401

Reputation: 575

Would merging openmp regions give a performance benefit?

I have a parallel code which is purely MPI. MPI scales pretty well within 8 cores. However, due to memory requirement I would have to use hybrid code. My code has following structure

for( A Sequential loop for 10e5 iterations)
{
    highly_parallelizable_function_call_1()
    some_sequential_work
    highly_parallelizable_function_call_2()
    some_sequential_work
    MPI_send() 
    MPI_recv() 
    highly_parallelizable_function_call_3()
    highly_parallelizable_function_call_4()    

}

roughly function 3 and 4 accounts for 90% of the time. I changed function 3 and 4 to openmp parallel code. And profiling shows I get only speed up of 4-5 on this. Hence this code might not scale as good as MPI alone code. This I suspect can be due to threading overhead. To circumvent this I would like to change this code to to create thread only at the beginning, as follows

#pragma omp parallel
for( A Sequential loop for 10e5 iterations)
{
    parallel_version_function_call_1()

    if( thread_id==0) some_sequential_work 

    parallel_version_function_call_2()

    if( thread_id==0) some_sequential_work 
    if( thread_id==0) MPI_send() 
    if( thread_id==0) MPI_recv()

    parallel_version_function_call_3()
    parallel_version_function_call_4()    

}

Would doing something like this be beneficial ?

Upvotes: 1

Views: 88

Answers (1)

user1829358
user1829358

Reputation: 1091

I think that your current implementation does not pay attention to Amdahl's law (google it if you like). Given that you only parallelized 90% of your code the best possible speedup that you can ask for (given 8 cores) is:

Speedup =  1.0 / (p_{seq} + (1 - p_{parallel}) / #cores)

Which in your case is:

Speedup = 1.0 / ( 0.1 + 0.9 / 8) = 4.71

So your current openmp parallelization is doing exactly what would be expected. Long answer short: Yes, the later implementation should give you a better speedup if this would mean that function 1 and two would be parallelized as well.

Upvotes: 1

Related Questions