Reputation: 11
I used the following statement from OpenMP:
omp_set_num_threads(6);
#pragma omp parallel for
for(int i = 0; i < NUMS; ++i){
printf("id is %3d thread is %d\n",i, omp_get_thread_num());
}
I found out (in a blog post): each thread will be evenly allocated iterations, and, when the number of threads is not divisible by the number of iterations, it will be rounded up.
Well, I first set NUMS=17
, the result is as follows:
id is 12 thread is 4
id is 13 thread is 4
id is 14 thread is 4
id is 9 thread is 3
id is 10 thread is 3
id is 11 thread is 3
id is 0 thread is 0
id is 1 thread is 0
id is 2 thread is 0
id is 6 thread is 2
id is 7 thread is 2
id is 8 thread is 2
id is 15 thread is 5
id is 16 thread is 5
id is 3 thread is 1
id is 4 thread is 1
id is 5 thread is 1
As can be seen, \lceil 17/6 \rceil = 3 (round up, forgive me, I don't know how to insert Latex formulas), the result is as expected.
However, if I set NUMS=19
, according to rounding up 19/6 = 4, each thread should be allocated 4 iterations, however:
id is 0 thread is 0
id is 1 thread is 0
id is 2 thread is 0
id is 3 thread is 0
id is 10 thread is 3
id is 11 thread is 3
id is 12 thread is 3
id is 7 thread is 2
id is 8 thread is 2
id is 9 thread is 2
id is 13 thread is 4
id is 14 thread is 4
id is 15 thread is 4
id is 4 thread is 1
id is 5 thread is 1
id is 6 thread is 1
id is 16 thread is 5
id is 17 thread is 5
id is 18 thread is 5
As you can see, only the first one is assigned 4 iterations.
So I can't figure it out now, what exactly is the reason for this? What exactly is OpenMP's default thread scheduling?
Upvotes: 0
Views: 327
Reputation: 2859
Summarising all the comments to create an answer (so thanks to all who commented).
First, what the standard says/requires :-
schedule(static)
with no chunk_size is only specified as allocating "approximately equal" numbers of iterations to each available thread.Second, what happens in reality at present :-
Compilers default to using schedule(static)
when no schedule is specified. (Though schedule(nonmonotonic:dynamic)
might, now, be a better choice.)
At least the LLVM OpenMP runtime allocates iterations to threads as this Python code shows (the critical part is myCount
, the rest is just to test it and show your test cases).
#
# Show number of iterations allocated to each thread
# by schedule(static) in the LLVM runtime
#
def myCount(me, numThreads, total):
base = total // numThreads
remainder = total % numThreads
return base+1 if me < remainder else base
def test(numThreads, total):
print("Threads: ",numThreads, " Iterations: ",total)
allocated = 0
for thread in range(0,numThreads):
mine = myCount(thread,numThreads,total)
allocated += mine
print (thread, ": ", mine)
if allocated != total:
print ("***ERROR*** ", allocated," allocated, ", total," requested")
test(6,17)
test(6,19)
If you run that you can see the result for your two test cases:-
% python3 static.py
Threads: 6 Iterations: 17
0 : 3
1 : 3
2 : 3
3 : 3
4 : 3
5 : 2
Threads: 6 Iterations: 19
0 : 4
1 : 3
2 : 3
3 : 3
4 : 3
5 : 3
If you want to get into the full horror of loop scheduling, there is a whole chapter on this in "High Performance Parallel Runtimes -- Design and Implementation"
p.s. It's worth noting that the schedule shown above cannot be explicitly requested by setting a block_size
on a static schedule, since the standard does not then allow the remainder iterations to be split up as they are here. (E.g. If we try to allocate 10 iterations to 4 threads, if we set block_size(2)
we'd get (4,2,2,2) whereas if we set it to 3 we'd get (3,3,3,1), whereas the scheme above gives (3,3,2,2), which has imbalance of 1 whereas the explicit schemes each have imbalance of 2).
Upvotes: 2