Is it better to use the collapse clause

Question

I am never sure which possibility I should choose to parallelize nested for loops.

For example I have the following code snippet:

#pragma omp parallel for schedule(static)
for(int b=0; b



In the first snippet I use parallel for (with schedule(static) because of the first touch policy). In some codes I saw people use mostly the collapse-clausel to parallize nested for loops in other codes it is never used instead the nested for loops are parallelized with a simple parallel for. Is this more a habit or is there a difference between the two versions? Is there a reason some people never use collapse(n)?

Jim Cownie · Accepted Answer

As with everything in HPC, the answer is "It depends..."

Here it will depend on

How big your machine is and how big "bSize", and "N" are
What the content of the inner loop is

For static scheduling of iterations which all run in the same amount of time, unless you can guarantee that number of iterations being work-shared divides by the number of threads, you need to ensure that the number of available iterations is ~10x the number of threads to guarantee 90% efficiency because of potential imbalance. Therefore if you have a 16 core machine you want >160 iterations. If "bSize" is small, then using collapse to generate more available parallelism will help performance. (In the worst case, imagine that "bSize" is smaller than the number of threads!)

On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.

On the third hand, there is nothing to stop you doing both :-

#pragma omp for simd collapse(2)
for(int b=0; b



If your inner loop really is this small (and vectorizable) then you certainly want to vectorize it, since, unlike parallelism, vectorization can reduce the total CPU time you use, rather than just moving it between cores.

Is it better to use the collapse clause

Answers (1)

Related Questions