Reputation: 4104
I am trying the openMP parallelism, to multiply 2 matrixes using 2 threads. I understand how the outer loop parallelism works (i.e without the "collapse(2)" works).
Now, using collapse:
#pragma omp parallel for collapse(2) num_threads(2)
for( i = 0; i < m; i++)
for( j = 0; j < n; j++)
{
s = 0;
for( k = 0; k < p; k++)
s += A[i][k] * B[k][j];
C[i][j] = s;
}
From what I gather, collapse "collapses" the loops into a single big loop, and then uses threads in the big loop. So, for the previous code, i think that it would be equivalent to something like this:
#pragma omp parallel for num_threads(2)
for (ij = 0; ij <n*m; ij++)
{
i= ij/n;
j= mod(ij,n);
s = 0;
for( k = 0; k < p; k++)
s += A[i][k] * B[k][j];
C[i][j] = s;
}
My questions are:
PS: Now that i am thinking a little bit more, in case that n is an odd number, say 3, without the collapse one thread would have 2 iterations, and the other just one. That results in uneven jobs for the threads, and a bit less efficient.
If we were to use my collapse equivalent (if that is how collapse indeed works) each thread would have "1.5" iterations. If n would be very large, that would not really matter, would it? Not to mention, doing that i= ij/n; j= mod(ij,n);
every time, it decreases performance, doesn't it?
Upvotes: 4
Views: 4960
Reputation: 22670
The exact behavior is not specified by the standard itself. However, the standard requires that the inner loop has exactly the same iterations for each iteration of the outer loop. This allows the following transformation:
#pragma omp parallel
{
int iter_total = m * n;
int iter_per_thread = 1 + (iter_total - 1) / omp_num_threads(); // ceil
int iter_start = iter_per_thread * omp_get_thread_num();
int iter_end = min(iter_iter_start + iter_per_thread, iter_total);
int ij = iter_start;
for (int i = iter_start / n;; i++) {
for (int j = iter_start % n; j < n; j++) {
// normal loop body
ij++;
if (ij == iter_end) {
goto end;
}
}
}
end:
}
From skimming the disassembly, i believe this is similar to what GCC does. It does avoid the per-iteration division/modulo, but costs one register and addition per inner iterator. Of course it will vary for different scheduling strategies.
Collapsing loops does increase the number of loop iterations that can be assigned to threads, thus helping with load balance or even exposing enough parallel work in the first place.
Upvotes: 2
Reputation: 74445
The OpenMP specification says just (page 58 of Version 4.5):
If a
collapse
clause is specified with a parameter value greater than 1, then the iterations of the associated loops to which the clause applies are collapsed into one larger iteration space that is then divided according to theschedule
clause. The sequential execution of the iterations in these associated loops determines the order of the iterations in the collapsed iteration space.
So, basically your logic is correct, except that your code is equivalent to the schedule(static,1) collapse(2)
case, i.e. iteration chunk size of 1. In the general case, most OpenMP runtimes have default schedule of schedule(static)
, which means that the chunk size will be (approximately) equal to the number of iterations divided by the number of threads. The compiler may then use some optimisation to implement it by e.g. running a partial inner loop for a fixed value for the outer loop, then an integer number of outer iterations with complete inner loops, then a partial inner loop again.
For example, the following code:
#pragma omp parallel for collapse(2)
for (int i = 0; i < 100; i++)
for (int j = 0; j < 100; j++)
a[100*i+j] = i+j;
gets transformed by the OpenMP engine of GCC into:
<bb 3>:
i = 0;
j = 0;
D.1626 = __builtin_GOMP_loop_static_start (0, 10000, 1, 0, &.istart0.3, &.iend0.4);
if (D.1626 != 0)
goto <bb 8>;
else
goto <bb 5>;
<bb 8>:
.iter.1 = .istart0.3;
.iend0.5 = .iend0.4;
.tem.6 = .iter.1;
D.1630 = .tem.6 % 100;
j = (int) D.1630;
.tem.6 = .tem.6 / 100;
D.1631 = .tem.6 % 100;
i = (int) D.1631;
<bb 4>:
D.1632 = i * 100;
D.1633 = D.1632 + j;
D.1634 = (long unsigned int) D.1633;
D.1635 = D.1634 * 4;
D.1636 = .omp_data_i->a;
D.1637 = D.1636 + D.1635;
D.1638 = i + j;
*D.1637 = D.1638;
.iter.1 = .iter.1 + 1;
if (.iter.1 < .iend0.5)
goto <bb 10>;
else
goto <bb 9>;
<bb 9>:
D.1639 = __builtin_GOMP_loop_static_next (&.istart0.3, &.iend0.4);
if (D.1639 != 0)
goto <bb 8>;
else
goto <bb 5>;
<bb 10>:
j = j + 1;
if (j <= 99)
goto <bb 4>;
else
goto <bb 11>;
<bb 11>:
j = 0;
i = i + 1;
goto <bb 4>;
<bb 5>:
__builtin_GOMP_loop_end_nowait ();
<bb 6>:
This is a C-like representation of the program's abstract syntax tree, which probably a bit hard to read, but what it does is, it uses modulo arithmetic only once to compute the initial values of i
and j
based on the start of the iteration block (.istart0.3
) determined by the call to GOMP_loop_static_start()
. Then it simply increases i
and j
as one would expect a loop nest to be implemented, i.e. increase j
until it hits 100
, then reset j
to 0 and increase i
. At the same time, it also keeps the current iteration number from the collapsed iteration space in .iter.1
, basically iterating at the same time both the single collapsed loop and the two nested loops.
As to case when the number of threads does not divide the number of iterations, the OpenMP standard says:
When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. The size of the chunks is unspecified in this case.
The GCC implementation leaves the threads with highest IDs doing one iteration less. Other possible distribution strategies are outlined in the note on page 61. The list is by no means exhaustive.
Upvotes: 3