Dongze Yang
Dongze Yang

Reputation: 11

Default Loop Iteration Scheduling in OpenMP

I used the following statement from OpenMP:

omp_set_num_threads(6);
#pragma omp parallel for
for(int i = 0; i < NUMS; ++i){
    printf("id is %3d   thread is %d\n",i, omp_get_thread_num());
}

I found out (in a blog post): each thread will be evenly allocated iterations, and, when the number of threads is not divisible by the number of iterations, it will be rounded up.

Well, I first set NUMS=17, the result is as follows:

id is  12   thread is 4
id is  13   thread is 4
id is  14   thread is 4
id is   9   thread is 3
id is  10   thread is 3
id is  11   thread is 3
id is   0   thread is 0
id is   1   thread is 0
id is   2   thread is 0
id is   6   thread is 2
id is   7   thread is 2
id is   8   thread is 2
id is  15   thread is 5
id is  16   thread is 5
id is   3   thread is 1
id is   4   thread is 1
id is   5   thread is 1

As can be seen, \lceil 17/6 \rceil = 3 (round up, forgive me, I don't know how to insert Latex formulas), the result is as expected.

However, if I set NUMS=19, according to rounding up 19/6 = 4, each thread should be allocated 4 iterations, however:

id is   0   thread is 0
id is   1   thread is 0
id is   2   thread is 0
id is   3   thread is 0
id is  10   thread is 3
id is  11   thread is 3
id is  12   thread is 3
id is   7   thread is 2
id is   8   thread is 2
id is   9   thread is 2
id is  13   thread is 4
id is  14   thread is 4
id is  15   thread is 4
id is   4   thread is 1
id is   5   thread is 1
id is   6   thread is 1
id is  16   thread is 5
id is  17   thread is 5
id is  18   thread is 5

As you can see, only the first one is assigned 4 iterations.

So I can't figure it out now, what exactly is the reason for this? What exactly is OpenMP's default thread scheduling?

Upvotes: 0

Views: 327

Answers (1)

Jim Cownie
Jim Cownie

Reputation: 2859

Summarising all the comments to create an answer (so thanks to all who commented).

First, what the standard says/requires :-

  • It says nothing about which schedule should be used when it is unspecified.
  • schedule(static) with no chunk_size is only specified as allocating "approximately equal" numbers of iterations to each available thread.

Second, what happens in reality at present :-

  • Compilers default to using schedule(static) when no schedule is specified. (Though schedule(nonmonotonic:dynamic) might, now, be a better choice.)

  • At least the LLVM OpenMP runtime allocates iterations to threads as this Python code shows (the critical part is myCount, the rest is just to test it and show your test cases).

    #
    # Show number of iterations allocated to each thread
    # by schedule(static) in the LLVM runtime
    #
    
     def myCount(me, numThreads, total):
         base = total // numThreads
         remainder = total % numThreads
         return base+1 if me < remainder else base
    
     def test(numThreads, total):
         print("Threads: ",numThreads, " Iterations: ",total)
         allocated = 0
         for thread in range(0,numThreads):
             mine = myCount(thread,numThreads,total)
             allocated += mine
             print (thread, ": ", mine)
         if allocated != total:
             print ("***ERROR*** ", allocated," allocated, ", total," requested")
     test(6,17)
     test(6,19)
    

If you run that you can see the result for your two test cases:-

% python3 static.py 
Threads:  6  Iterations:  17
0 :  3
1 :  3
2 :  3
3 :  3
4 :  3
5 :  2
Threads:  6  Iterations:  19
0 :  4
1 :  3
2 :  3
3 :  3
4 :  3
5 :  3

If you want to get into the full horror of loop scheduling, there is a whole chapter on this in "High Performance Parallel Runtimes -- Design and Implementation"

p.s. It's worth noting that the schedule shown above cannot be explicitly requested by setting a block_size on a static schedule, since the standard does not then allow the remainder iterations to be split up as they are here. (E.g. If we try to allocate 10 iterations to 4 threads, if we set block_size(2) we'd get (4,2,2,2) whereas if we set it to 3 we'd get (3,3,3,1), whereas the scheme above gives (3,3,2,2), which has imbalance of 1 whereas the explicit schemes each have imbalance of 2).

Upvotes: 2

Related Questions