Reputation: 71
I am trying to parallelise these recursive functions with openMP tasks,
when I compile with gcc it runs only on 1 thread. When i compile it with clang it runs on multiple threads
The second function calls the first one which doesn't generate new tasks to stop wasting time.
gcc does work when there is only one function that calls itself.
Why is this?
Am I doing something wrong in the code?
Then why does it work with clang?
I am using gcc 9.3 on windows with Msys2. The code was compiled with -O3 -fopenmp
//the program compiled by gcc only runs on one thread
#include<vector>
#include<omp.h>
#include<iostream>
#include<ctime>
using namespace std;
vector<int> vec;
thread_local double steps;
void excalibur(int current_node, int current_depth) {
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
excalibur(i, current_depth);
}
if(current_depth > 0){
int new_depth = current_depth - 1;
#pragma omp simd
for(int i = current_node;i <= vec[current_node];i++){
++steps;
excalibur(i + 1,new_depth);
}
}
}
void mario( int current_node, int current_depth) {
#pragma omp task firstprivate(current_node,current_depth)
{
if(current_depth > 0){
int new_depth = current_depth - 1;
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
}
}
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
excalibur(i, current_depth);
}
}
int main() {
double total = 0;
clock_t tim = clock();
omp_set_dynamic(0);
int nodes = 10;
int timesteps = 3;
omp_set_num_threads(4);
vec.assign( nodes, nodes - 2 );
#pragma omp parallel
{
steps = 0;
#pragma omp single
{
mario(nodes - 1, timesteps - 1);
}
#pragma omp atomic
total += steps;
}
double time_taken = (double)(tim) / CLOCKS_PER_SEC;
cout <<fixed<<total<<" steps, "<< fixed << time_taken << " seconds"<<endl;
return 0;
}
while this works with gcc
#include<vector>
#include<omp.h>
#include<iostream>
#include<ctime>
using namespace std;
vector<int> vec;
thread_local double steps;
void mario( int current_node, int current_depth) {
#pragma omp task firstprivate(current_node,current_depth)
{
if(current_depth > 0){
int new_depth = current_depth - 1;
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
}
}
#pragma omp simd
for( int i = 0 ; i < current_node; i++){
++steps;
mario(i, current_depth);
}
}
int main() {
double total = 0;
clock_t tim = clock();
omp_set_dynamic(0);
int nodes = 10;
int timesteps = 3;
omp_set_num_threads(4);
vec.assign( nodes, nodes - 2 );
#pragma omp parallel
{
steps = 0;
#pragma omp single
{
mario(nodes - 1, timesteps - 1);
}
#pragma omp atomic
total += steps;
}
double time_taken = (double)(tim) / CLOCKS_PER_SEC;
cout <<fixed<<total<<" steps, "<< fixed << time_taken << " seconds"<<endl;
return 0;
}
Upvotes: 0
Views: 145
Reputation: 74455
Your program doesn't run in parallel because there is simply nothing to run in parallel. Upon first entry in mario
, current_node
is 9
and vec
is all 8
s, so this loop in the first and only task never executes:
for(int i = current_node;i <= vec[current_node];i++){
++steps;
mario(i + 1,new_depth);
}
Hence, no recursive creation of new tasks. How and what runs in parallel when you compile it with Clang is well beyond me, since when I compile it with Clang 9, the executable behaves exactly the same as the one produced by GCC.
The second code runs in parallel because of the recursive call in the loop after the task
region. But it also isn't a correct OpenMP program - the specification forbids nesting task
regions inside a simd
construct (see under Restrictions here):
- The only OpenMP constructs that can be encountered during execution of a
simd
region are theatomic
construct, theloop
construct, thesimd
construct and theordered
construct with thesimd
clause.
None of the two compilers catches that problem when the nesting is in the dynamic and not in the lexical scope of the simd
construct though.
Edit: I actually looked it a bit closer into it and I may have a suspicion about what might have caused your confusion. I guess you determine if your program works in parallel or not by looking at the CPU utilisation while it runs. This often leads to confusion. The Intel OpenMP runtime that Clang uses has a very aggressive waiting policy. When the parallel region in the main()
function spawns a team of four threads, one of them goes executing mario()
and the other three hit the implicit barrier at the end of the region. There they spin, waiting for new tasks to be eventually assigned to them. They never get one, but keep on spinning anyway, and that's what you see in the CPU utilisation. If you want to replicate the same with GCC, set OMP_WAIT_POLICY
to ACTIVE
and you'll see the CPU usage soar while the program runs. Still, if you profile the program's execution, you'll see that CPU time is spent inside your code in one thread only.
Upvotes: 1