Differences between `#pragma parallel for collapse` and `#pragma omp parallel for`

Question

Firstly, the question might be slightly misleading, I understand the main differences between the collapse clause in a parallel region and a region without one. Let's say I want to transpose a matrix and there are the following two methods, First a parallel for with SIMD directive for the inner loop and a second method using the collapse(2) clause

#pragma omp parallel for
    for(int i=0; i


#pragma omp parallel for collapse(2)
    for(int i=0; i

In the two methods above, which would be more efficient especially in terms of caching?. 

Out of the two above, which of the implementations would be more efficient and faster? Is there any way we can ascertain that just by looking at the implementations.
And Given all the loop counters are independent of each other, can one set a basic guideline as to when to use when? 

TIA

J&#233;r&#244;me Richard · Accepted Answer

TL;DR: both implementations are quite inefficient. The second one will likely be slower than the first in practice, although it could theoretically scale better.

The first implementation is unlikely to be vectorized because the accesses are not contiguous in memory. Both GCC 10 and Clang 11 generate inefficient code. The point is that OpenMP provide no high-level SIMD construct to deal with data transposition! Thus, if you want to do it efficiently, you probably need to get your hands dirty by doing it yourself (or by using an external library that does it for you).

The second implementation could be significantly slower than the first implementation because the loop iterator is linearized often resulting in more instruction to be executed in the hot path. Some implementation (eg. Clang 11 and ICC 19 but not GCC 10) even use a very slow modulus operation (ie. div instruction) to do so resulting in a much slower loop.

The second implementation should also theoretically scale better than the first because the collapse clause provide more parallelism. Indeed, in the first implementation, there is only rows lines to share between n threads. So, if you work on massively parallel machines or wide rectangular matrices, with n not so small compared to rows, this could cause some work imbalance, or even thread starvation.

Why both implementations are inefficient

The two implementations are inefficient because of the memory access pattern. Indeed, on big matrices, writes in output are not contiguous and will cause many cache misses. A full cache line (64 bytes on most common architectures) will be written while only few bytes will be written into it. If columns is a power of two, cache thrashing will occurs and further decrease performance.

One solution to mitigate these issues is to use tiling. Here is an example:

// Assume rows and columns are nice for sake of clarity ;)
constexpr int tileSize = 8;
assert(rows % tileSize == 0);
assert(columns % tileSize == 0);

// Note the collapse clause is needed here for scalability and 
// the collapse overhead is mitigated by the inner loop.
#pragma omp parallel for collapse(2)
for(int i=0; i


The above code should be faster, but not optimal. Successfully writing a fast transposition code is challenging. Here is some advises to improve the code:

use a temporary tile buffer to improve the memory access pattern (so the compiler can use fast SIMD instructions)
use square tiles to improve the use of the cache
use multi-level tiling to improve the use of the L2/L3 cache or use a Z-tiling approach

Alternatively, you can simply use a fast BLAS implementation providing matrix transposition functions quite well optimized (not all do, but AFAIK OpenBLAS and the MKL does).
PS: I assumed matrices are stored in a row-major order.

Differences between `#pragma parallel for collapse` and `#pragma omp parallel for`

Answers (1)

Why both implementations are inefficient

Related Questions