Reputation: 575
I have a parallel code which is purely MPI. MPI scales pretty well within 8 cores. However, due to memory requirement I would have to use hybrid code. My code has following structure
for( A Sequential loop for 10e5 iterations)
{
highly_parallelizable_function_call_1()
some_sequential_work
highly_parallelizable_function_call_2()
some_sequential_work
MPI_send()
MPI_recv()
highly_parallelizable_function_call_3()
highly_parallelizable_function_call_4()
}
roughly function 3 and 4 accounts for 90% of the time. I changed function 3 and 4 to openmp parallel code. And profiling shows I get only speed up of 4-5 on this. Hence this code might not scale as good as MPI alone code. This I suspect can be due to threading overhead. To circumvent this I would like to change this code to to create thread only at the beginning, as follows
#pragma omp parallel
for( A Sequential loop for 10e5 iterations)
{
parallel_version_function_call_1()
if( thread_id==0) some_sequential_work
parallel_version_function_call_2()
if( thread_id==0) some_sequential_work
if( thread_id==0) MPI_send()
if( thread_id==0) MPI_recv()
parallel_version_function_call_3()
parallel_version_function_call_4()
}
Would doing something like this be beneficial ?
Upvotes: 1
Views: 88
Reputation: 1091
I think that your current implementation does not pay attention to Amdahl's law (google it if you like). Given that you only parallelized 90% of your code the best possible speedup that you can ask for (given 8 cores) is:
Speedup = 1.0 / (p_{seq} + (1 - p_{parallel}) / #cores)
Which in your case is:
Speedup = 1.0 / ( 0.1 + 0.9 / 8) = 4.71
So your current openmp parallelization is doing exactly what would be expected. Long answer short: Yes, the later implementation should give you a better speedup if this would mean that function 1 and two would be parallelized as well.
Upvotes: 1