why inserting into set with same amount has different runtimes based on different approaches?

Question

I am writing a parallel program that generates combinations as a boolean vector form. Since the combinations are generated using threads, I am combining them when every thread finishes its job.

set> Ci[T.size()];
//generate combinations in parallel and thread i add result in Ci[i]

start_time = omp_get_wtime();
set> C;
for(int i=0;i



The total number of threads, T.size() will vary but total combination(C's size) is always the same. 
For example, 

Enter threads:10
i=0 1
i=1 144
i=2 4320
i=3 47040
i=4 229320
i=5 550368
i=6 672672
i=7 411840
i=8 115830
i=9 11440
Total time:36.641s
C's size :2042975

Enter threads:128
i=0 9
i=1 45
i=2 165
...
i=120   10
i=121   11
i=122   12
i=123   13
i=124   14
i=125   15
i=126   16
i=127   18
Total time:6.432s
C's size :2042975


What I don't understand is in both case, I am inserting the same amount of combinations into C. Why is the time taken not the same?

Daniel Langr · Accepted Answer

We don't see all the necessary information, but I suspect the problem is caused by the uneven distribution of set sizes. The complexity of the insertion is O(n log(m + n)), where n is the size of a set being inserted into (C) and m is the size of the set being inserted (Ci[i]).

Now, consider that both sets together have 2^24 elements.

Case 1: n=2^16, m=2^24-2^16

Here, the term inside "big O" is turned into 2^16 * log(2^24-2^16+2^16), which is 2^16*24.

Case 2: n=2^23, m=2^23

Here, the same term is 2^23*24, which is 2^7 times more than in Case 1.

My point is that if you "merge" two sets with the same total number of elements, the runtime may very much depend on how are these elements distributed among them.

In your first case (10 threads), the distribution of set sizes seems to be more like in Case 2 (most of them have a relatively high number of elements). We don't see the complete data for your second case (128 threads), but the set sizes seem to be much more unevenly distributed since the shown ones have only very few elements. (Note that it may, or may not, indicate that your parallel computation is poorly load balanced.)

why inserting into set with same amount has different runtimes based on different approaches?

Answers (1)

Related Questions