OpenMP: Is it better to use fewer threads that are long, or maximum available threads that are short?

Question

I have some C++ code that I am running for an optimisation task, and I am trying to parallelise it using OpenMP. I tried using #pragma omp parallel for on both loops, but realised pretty quickly that it didnt work, so I want to set up a conditional to decide whether to parallelise the outer or inner loop, depending on how many outer iterations there are.

Here is the code:

std::vector seeds; // vector with initial solutions
std::vector sols (N_OUTER*N_INNER); // vector for output solutions
int N_OUTER; // typically 1-8
int N_INNER;  // typically > 100
int PAR_THRESH; // this is the parameter I am interested in setting

#pragma omp parallel for if (N_OUTER >= PAR_THRESH)
for (int outer = 0; outer < N_OUTER; ++outer){
    #pragma omp parallel for if (N_OUTER < PAR_THRESH)
    for (int inner = 0; inner < N_INNER; ++inner){
        sols[outer*N_INNER + inner] = solve(seeds[outer]);
    }
}

This works fine to decide which loop (inner or outer) gets parallelised; however, I am trying to determine what is the best value for PAR_THRESH.

My intuition says that if N_OUTER is 1, then it shouldn't parallelise the outer loop, and if N_OUTER is greater than the number of threads available, then the outer loop should be the one to be parallelised; because it uses maximum available threads and the threads are long as possible. My question is about when N_OUTER is either 2 or 3 (4 being the number of threads available).

Is it better to run, say, 2 or 3 threads that are long, in parallel; but not use up all of the available threads? Or is it better to run the 2 or 3 outer loops in serial, while utilising the maximum number of threads for the inner loop?

Or is there a kind of trade off in play, and maybe 2 outer loop iterations might be wasting threads, but if there are 3 outer loop iterations, then having longer threads is beneficial, despite the fact that one thread is remaining unused?

EDIT:

edited code to replace N_ITER with N_INNER in two places

cocsackie · Accepted Answer

Didn't have much experience with OpenMP, but I have found something like collapse directive:

https://software.intel.com/en-us/articles/openmp-loop-collapse-directive

Understanding the collapse clause in openmp

It seems to be even more appropriate when number of inner loop iterations differs.

--

On the other hand:

It seems to me that solve(...) is side-effect free. It seems also that N_ITER is N_INNER.

Currently you calculate solve N_INNER*N_OUTER times. While reducing that won't reduce O notation complexity, assuming it has very large constant factor - it should save a lot of time. You cannot cache the result with collapse, so maybe this could be even better:

std::vector sols_tmp (N_INNER);
#pragma omp parallel for
for (int i = 0; i < N_OUTER; ++i) { 
    sols_tmp[i] = solve(seeds[i]);
}

This calculates only N_OUTER times.

Because solve returns same value for each row:

#pragma omp parallel for
for (int i = 0; i < N_OUTER*N_INNER; ++i) {
    sols[i] = sols_tmp[i/N_INNER];
}

Of course it must be measured if parallelization is suitable for those loops.

OpenMP: Is it better to use fewer threads that are long, or maximum available threads that are short?

Answers (1)

Related Questions