Reputation: 2594
I am trying to make tree operations like summing up numbers in all the leaves in a tree work in parallel using OpenMP. The problem I encounter is that the tree I work on is unbalanced (number of children vary and then how big branches are vary as well).
I currently have recursive functions working on those trees. What I am trying to achieve is this:
1)Split the threads at first possible opportunity, say it's a node with 2 children
2)Continue splitting from both resulting threads for at least 2-3 levels so all the threads are at work
It would look like this:
if (node->depth <= 3) {
#pragma omp parallel
{
#pragma omp schedule(dynamic)
for (int i = 0; i < node->children_no; i++) {
int local_sum;
local_sum = sum_numbers(node->children[i])
#pragma omp critical
{
global_sum += local_sum;
}
}
}
} else {
/*run the for loop without parallel region*/
}
The problem here is that when I allow nested parallelism it seems OpenMP creates a lot of threads in new teams. What I would like to achieve is this:
1)Every thread creating a new team can't take more threads than MAX_THREADS
2)Once a for loop is over in one subtree the others still working for loops in bigger subtrees take over the now idle threads to finish their job faster
That way I hope there is never more threads than necessary but they are all working all the time as long as there are more unfinished tasks in all for loops combined than created threads.
From the docs it looks like parallel for uses only threads already created in parallel region. Is it possible to make it work as described or do I need to change the implementation to list the tasks form various branches first and then run parallel for loop over that list?
Upvotes: 4
Views: 628
Reputation: 3180
Just for the record, I'll write an answer to this question based on High Performance Mark's comment (a comment on which I agree, too). The usage of OpenMP tasks here will add flexibility to the parallelism even if the tree is unbalanced, support recursivity and spawn enough work for all the threads (despite you should explore this using tools such as Vampir, Paraver and/or HPCToolkit).
The resulting code could look like
if (node->depth <= 3) {
#pragma omp parallel shared (global_sum)
{
for (int i = 0; i < node->children_no; i++) {
int local_sum;
#pragma omp single
#pragma omp task
{
local_sum = sum_numbers(node->children[i])
#pragma omp critical
global_sum += local_sum;
}
}
}
} else {
/*run the for loop without parallel region*/
}
Upvotes: 2