Suslik
Suslik

Reputation: 1071

Is it better to use the collapse clause

I am never sure which possibility I should choose to parallelize nested for loops.

For example I have the following code snippet:

#pragma omp parallel for schedule(static)
for(int b=0; b<bSize; b++)
    for(int n=0; n<N; n++) o[n + b*N] = b[n];


#pragma omp parallel for collapse(2) schedule(static)
for(int b=0; b<bSize; b++)
    for(int n=0; n<N; n++) o[n + b*N] = b[n];

In the first snippet I use parallel for (with schedule(static) because of the first touch policy). In some codes I saw people use mostly the collapse-clausel to parallize nested for loops in other codes it is never used instead the nested for loops are parallelized with a simple parallel for. Is this more a habit or is there a difference between the two versions? Is there a reason some people never use collapse(n)?

Upvotes: 1

Views: 630

Answers (1)

Jim Cownie
Jim Cownie

Reputation: 2869

As with everything in HPC, the answer is "It depends..."

Here it will depend on

  1. How big your machine is and how big "bSize", and "N" are
  2. What the content of the inner loop is

For static scheduling of iterations which all run in the same amount of time, unless you can guarantee that number of iterations being work-shared divides by the number of threads, you need to ensure that the number of available iterations is ~10x the number of threads to guarantee 90% efficiency because of potential imbalance. Therefore if you have a 16 core machine you want >160 iterations. If "bSize" is small, then using collapse to generate more available parallelism will help performance. (In the worst case, imagine that "bSize" is smaller than the number of threads!)

On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.

On the third hand, there is nothing to stop you doing both :-

#pragma omp for simd collapse(2)
for(int b=0; b<bSize; b++)
    for(int n=0; n<N; n++) o[n + b*N] = b[n];

If your inner loop really is this small (and vectorizable) then you certainly want to vectorize it, since, unlike parallelism, vectorization can reduce the total CPU time you use, rather than just moving it between cores.

Upvotes: 3

Related Questions