Difference between the several ways to parallelize nested for loops in C, C++ using OpenMP

Question

I've just started studying parallel programming with OpenMP, and there is a subtle point in the nested loop. I wrote a simple matrix multiplication code, and checked the result that is correct. But actually there are several ways to parallelize this for loop, which may be different in terms of low-level detail, and I wanna ask about it.

At first, I wrote code below, which multiply two matrix A, B and assign the result to C.

for(i = 0; i < N; i++)
{
    for(j = 0; j < N; j++)
    {
        sum = 0;
#pragma omp parallel for reduction(+:sum)
        for(k = 0; k < N; k++)
        {
            sum += A[i][k]*B[k][j];
        }
        C[i][j] = sum;
    }
}

It works, but it takes really long time. And I find out that because of the location of parallel directive, it will construct the parallel region N² time. I found it by huge increase in user time when I used linux time command.

Next time, I tried code below which also worked.

#pragma omp parallel for private(i, j, k, sum)
for(i = 0; i < N; i++)
{
    for(j = 0; j < N; j++)
    {
        sum = 0;
        for(k = 0; k < N; k++)
        {
            sum += A[i][k]*B[k][j];
        }
        C[i][j] = sum;
    }
}

And the elapsed time is decreased from 72.720s in sequential execution to 5.782s in parallel execution with the code above. And it is the reasonable result because I executed it with 16 cores.

But the flow of the second code is not easily drawn in my mind. I know that if we privatize all loop variables, the program will consider that nested loop as one large loop with size N³. It can be easily checked by executing the code below.

#pragma omp parallel for private(i, j, k)
for(i = 0; i < N; i++)
{
    for(j = 0; j < N; j++)
    {
        for(k = 0; k < N; k++)
        {
            printf("%d, %d, %d
", i, j, k);
        }
    }
}

The printf was executed N³ times.

But in my second matrix multiplication code, there is sum right before and after the innermost loop. And It bothers me to unfold the loop in my mind easily. The third code I wrote is easily unfolded in my mind.

To summarize, I want to know what really happens behind the scene in my second matrix multiplication code, especially with the change of the value of sum. Or I'll really thank you for some recommendation of tools to observe the flow of multithreads program written with OpenMP.

Zulan · Accepted Answer

omp for by default only applies to the next direct loop. The inner loops are not affected at all. This means, your can think about your second version like this:

// Example for two threads
with one thread execute
{
    // declare private variables "locally"
    int i, j, k;
    for(i = 0; i < N / 2; i++) // loop range changed
    {
        for(j = 0; j < N; j++)
        {
            sum = 0;
            for(k = 0; k < N; k++)
            {
                sum += A[i][k]*B[k][j];
            }
            C[i][j] = sum;
        }
    }
}
with the other thread execute
{
    // declare private variables "locally"
    int i, j, k;
    for(i = N / 2; i < N; i++) // loop range changed
    {
        for(j = 0; j < N; j++)
        {
            sum = 0;
            for(k = 0; k < N; k++)
            {
                sum += A[i][k]*B[k][j];
            }
            C[i][j] = sum;
        }
    }
}

You can simply all reasoning about variables with OpenMP by declaring them as locally as possible. I.e. instead of the explicit declaration use:

#pragma omp parallel for
for(int i = 0; i < N; i++)
{
    for(int j = 0; j < N; j++)
    {
        int sum = 0;
        for(int k = 0; k < N; k++)
        {
            sum += A[i][k]*B[k][j];
        }
        C[i][j] = sum;
    }
}

This way you the private scope of variable more easily.

In some cases it can be beneficial to apply parallelism to multiple loops. This is done by using collapse, i.e.

#pragma omp parallel for collapse(2)
for(int i = 0; i < N; i++)
{
    for(int j = 0; j < N; j++)

You can imagine this works with a transformation like:

#pragma omp parallel for
for (int ij = 0; ij < N * N; ij++)
{
    int i = ij / N;
    int j = ij % N;

A collapse(3) would not work for this loop because of the sum = 0 in-between.

Now is one more detail:

#pragma omp parallel for

is a shorthand for

#pragma omp parallel
#pragma omp for

The first creates the threads - the second shares the work of a loop among all threads reaching this point. This may not be of importance for the understanding now, but there are use-cases for which it matters. For instance you could write:

#pragma omp parallel
for(int i = 0; i < N; i++)
{
    #pragma omp for
    for(int j = 0; j < N; j++)
    {

I hope this sheds some light on what happens there from a logical point of view.

Difference between the several ways to parallelize nested for loops in C, C++ using OpenMP

Answers (1)

Related Questions