LennBo
LennBo

Reputation: 25

Performance OpenMP collapse vs no collapse for large nested loops

I was wondering if anyone knows a bid more about the perfomance of the collapse cause for large nested loops? Meaning I would like to compare the pragmas

omp parallel for private(i,j,k) collapse(3) schedule(static)

and

omp parallel for private(i,j,k) schedule(static)

for a nested loop construct like

for(int i=0; i<i_max; i++){
  for(int j=0; j<j_max; j++){
   for(int k=0; k<k_max; k++){
     A[i][j][k]=B[i][j][k]+C[i][j][k];
                              }
                             }
                            }

where i_max, j_max and k_max are all something like 5 - 10 times larger than the number of threads available.

If I understood the collapse cause correctly openmp would just collapse the 3 loops into one having the size i_max*j_max*k_maxand I would assume best perfomance if (i_max*j_max*k_max) mod #threads = 0.

Is it right that without the collapse cause openmp would only take the i loop parallel? If so my next assumption would be to get the best perfomance for i_max mod #threads = 0 and I would expect comparable performances for both pragmas.

As you can see I am pretty much guessing here. Did anyone actually test the performance of both pragmas for cases such as this?

Upvotes: 2

Views: 5120

Answers (1)

Hristo Iliev
Hristo Iliev

Reputation: 74445

When you collapse the loops, OpenMP turns them into one big loop. The iteration space of that loop then gets divided into chunks and those get split among the threads according to the effective loop schedule. Depending on the divisibility of the individual loop iterations by the number of threads, you might end up in a situation where a chunk contains incomplete inner loops. An example case would be when none of i_max, j_max, and k_max is divisible by the number of threads, but i_max * j_max * k_max is. Moreover, different chunks might contain different parts of the incomplete loops. And then this all depends on the run-time configurable number of threads. So the vectoriser might be inhibited as the compiler is unable to reliably model the vectorisation of the loop and evaluate whether it will be beneficial or not. It will also have to create serial loops that handle the case when the number of loop iterations is not divisible by the vector length or the data is not aligned.

When only the outer loop is parallel, the compiler is free to transform the inner loops as it sees fit, e.g., it could safely vectorise those loops. Whether that will be faster than the previous case is not clear. Vectorisation improves computational performance, but it also puts more pressure on the memory subsystem. The ratio of those determines if there will be a benefit.

On the other hand, assuming that A, B, and C are all i_max x j_max x k_max (and that x <= x_max is a typo and should actually be x < x_max), a really smart compiler will notice that you are iterating over all possible indices and basically summing two 1-D vectors and turn the collapsed loop into something like

#pragma omp parallel schedule(static)
for (z = 0; z < i_max * j_max * k_max; z++)
   A_lin[z] = B_lin[z] + C_lin[z];

where X_lin[] is the linearised 1-D view of the data behind X[][][]. There is a great potential for vectorisation here, so it really depends on how much analysis the compiler is able to perform.

There is no silver bullet solution and no one algorithm that performs equally good across many types of hardware. That's why OpenMP provides a lot of tunables that can be set via environment variables. Also note that when comparing the performance on a server CPU and a desktop CPU, one should keep in mind that server CPUs usually have much larger last-level caches and more memory channels with higher main memory bandwidth, so vectorised codes perform better on large amounts of data.

Upvotes: 8

Related Questions