Is it normal that each thread in OpenMP does the same amount of work?

Question

For the following code, I have calculated time execution for each thread and it is odd to me that with all runs I get using the static or dynamic schedule, each thread has nearly exact time invocation. Is this something expected in OpenMP? Do we ever have the situation that one or more threads perform more jobs? Another thing which I do not understand is that time execution for both using static and dynamic schedule is the same. I'm afraid that maybe, I'm calculating time in a wrong manner.

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
using namespace std;
using namespace chrono; 
int main()
{
    const int N = 100000;
    ofstream result{"Result.txt"};
    vector> c;
    default_random_engine g(0);
    uniform_real_distribution d(0.0f, nextafter(1.0f, DBL_MAX));
    c.reserve(N);

    for (int i = 0; i < N; i++) {
        const unsigned size = pow(10, i % 4);
        vector a;
        a.reserve(size);

        for (int j = 0; j < size; j++) {
            const double number = d(g);
            a.push_back(number);
        }

        c.push_back(std::move(a));
    }

    double sum = 0.0;
    vector b(N);
    int total_threads=4; 
    double time_taken_by_threads[total_threads];
    auto t1= high_resolution_clock::now();
    
    #pragma omp parallel num_threads(4) firstprivate(N) shared(b,c,sum)
    
    {
        int threadID = omp_get_thread_num();
        double start = omp_get_wtime();
     
    
        #pragma omp for reduction(+:sum) schedule(dynamic)
        for (int i = 0; i < N ; i++) {
            double sumLocal = 0.0;

            for (int j = 0; j < c[i].size();j++) {
                sumLocal += pow(c[i][j], 2);
            }

            const double n = sqrt(sumLocal);
            b[i] = n;

            sum += sumLocal;
        }
        
      
        double end = omp_get_wtime();
       time_taken_by_threads[threadID] = end - start;
    }
      
    
    auto t2=high_resolution_clock::now();
    
    auto diff=duration_cast(t2-t1);
    
    cout<<"The total job has been taken : "<

dreamcrash · Accepted Answer

TL;DR You have an implicit barrier at the end of the #pragma omp for reduction(+:sum)

I'm afraid that maybe, I'm calculating time in a wrong manner.

Indeed it will always give similar results because the #pragma omp for:

    double start = omp_get_wtime();
    #pragma omp for reduction(+:sum) schedule(dynamic)
    for (int i = 0; i < N ; i++) {
        // ....
    }
   // <--- threads will wait here for one another.
   double end = omp_get_wtime();
   time_taken_by_threads[threadID] = end - start;

introduces an implicit barrier after the loop. So the threads that have finished first will still wait for those that have not. To remove that implicit barrier you can use the nowait clause:

#pragma omp for reduction(+:sum) schedule(dynamic) nowait

Even though in this code it is not a problem, one needs to be careful when removing the implicit barrier, because it might lead to race-conditions. So for future usage, you can use the following pattern to measure the time taken per thread and still avoid potential race-conditions.

    double start = omp_get_wtime();
    // The parallel loop with nowait
    double end = omp_get_wtime();
    #pragma omp barrier
    time_taken_by_threads[threadID] = end - start;

Nevertheless, even with that change, the time taken per thread should be approximately the same. I will explain below why is that the case.

For the following code, I have calculated time execution for each thread and it is odd to me that with all runs I get using the static or dynamic schedule, each thread has nearly exact time invocation. Is this something expected in OpenMP?

One can expect that when using the static schedule, OpenMP tries to divide as evenly as possible the number of loop iterations among threads.

From the OpenMP 5.1 standard, one can read the following about the for schedule clause:

When kind is static, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number. Each chunk contains chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations. When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. The size of the chunks is unspecified in this case.

In your case, when using the static distribution with the default chunk size, each of the 4 threads will compute 25000 iterations (i.e., 100000/4).

If we analyze the parallel loop:

#pragma omp for reduction(+:sum) schedule(static)
for (int i = 0; i < N ; i++) {
    double sumLocal = 0.0;

    for (int j = 0; j < c[i].size();j++) {
        sumLocal += pow(c[i][j], 2);
    }

    const double n = sqrt(sumLocal);
    b[i] = n;

    sum += sumLocal;
}

we can see that each iteration performances the same amount of computation, and that the computation is mostly CPU-bound, therefore it is expectable that each thread will take approximately the same amount of time.

Regarding the dynamic schedule from the OpenMP 5.1 standard one can read:

When kind is dynamic, the iterations are distributed to threads in the team in chunks. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed. Each chunk contains chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations. When no chunk_size is specified, it defaults to 1.

So because by default the chunk size is 1, and we already know that the iterations of the loop will take approximately the same amount of time, one can expect that threads will also take the same amount of time.

Do we ever have the situation that one or more threads perform more jobs?

Sure you just need to create a situation that leads to load-unbalancing, for example:

#pragma omp parallel for schedule(static)
  for(int i=0; i


If you look carefully, you can see that the work of the inner loop grows with a shape like a triangle (N = SIZE):
 *k/i 0 1 2 3 4 5 ... N-1
 *  0 - x x x x x ... x 
 *  1 - - x x x x ... x 
 *  2 - - - x x x ... x
 *  3 - - - - x x ... x
 *  4 - - - - - x ... x
 *  5 - - - - - - ... x
 *  . - - - - - - ... x
 *  . - - - - - - ... x 
 *N-1 - - - - - - ... -    
 *  N - - - - - - ... - 

So for 4 threads and for a N such that N % 4 = 0, to the threads 1 will be assigned the first N/4 iterations of the loop, to thread 2 the next N/4 and so on. Consequently, thread 1 computes outermost loop iterations with fewer innermost loop iterations, which leads to load unbalancing and eventually to threads having a bigger difference between their times taken to finish their parallel work.
You can emulate that scenario in your code as follows:
#pragma omp for reduction(+:sum) schedule(static) nowait
for (int i = 0; i < N ; i++) {
    double sumLocal = 0.0;

    for (int j = i; j < c[i].size();j++) {
        sumLocal += pow(c[i][j], 2);
    }
    const double n = sqrt(sumLocal);
    b[i] = n;

    sum += sumLocal;
}


Another thing which I do not understand is that time execution for
both using static and dynamic schedule is the same.

As we already explained, that is expectable, given the nature of the parallel task assigned to each thread.

Is it normal that each thread in OpenMP does the same amount of work?

Answers (1)

Related Questions