Reputation: 75
I have an algorithm for which one goal is to fill vectors. For the sake of performance, iterations of the algorithm are spread across OpenMP threads. I was wondering which way would provide the better/safer way of filling the vectors.
Note that the ordering of the vectors must be consistent (i.e. value n of vec1 must come from the same iteration than value n of vec2.)
Hypothesis 1:
std::vector<BasicType> vec1;
std::vector<BasicType> vec2;
#pragma opm parallel for
for(...)
{
// Do some intensive stuff to compute val1 and val2
// ...
#pragma omp critical
{
vec1.push_back(val1);
vec2.push_back(val2);
}
}
// Then go on to work with vec1 and vec2...
Hypothesis 2:
std::vector<BasicType> vec1;
std::vector<BasicType> vec2;
#pragma opm parallel
{
std::vector<BasicType> vec1local;
std::vector<BasicType> vec2local;
#pragma omp for
for(...)
{
// Do some intensive stuff to compute val1 and val2
// ...
vec1local.push_back(val1);
vec2local.push_back(val2);
}
#pragma omp critical
{
// See note 1 below
vec1.insert(vec1.end(), vec1local.begin(), vec1local.end());
vec2.insert(vec2.end(), vec2local.begin(), vec2local.end());
}
}
// Then go on to work with vec1 and vec2...
Note 1: This was taken from Best way to append vector to vector
Note 2: It seems val1 and val2 could be combined in some object in order to maintain consistency and work only with one vector, but for now it seems inpractical for the remainder of the algorithm.
Note 3: To give an order of magnitude, the for
loop contains around 100 iterations splitted among 4 threads. Apart for very few exceptions, each iteration is expected to have the same workload (which brings the issue of the critical sections happening around the same time.)
Note 4: Just for the records, the whole thing deals with image stabilisation, and runs on a Tegra K1 architecture. Compiler used is gcc 4.8.4.
Upvotes: 7
Views: 5421
Reputation: 33659
I am going to assume that you need to use a vector and cannot use an array (othrewise your question is not very interesting). Using t = omp_get_num_threads()
, you fill the vectors in parallel and then merge them in log2(t)
operations instead of t
operations (as you are doing now) like this
void reduce(std::vector<BasicType> *v1, std::vector<BasicType> *v2, int begin, int end) {
if(end - begin == 1) return;
int pivot = (begin+end)/2;
#pragma omp task
reduce(v, begin, pivot);
#pragma omp task
reduce(v, pivot, end);
#pragma omp taskwait
v1[begin].insert(v1[begin].end(), v1[pivot].begin(), v1[pivot].end());
v2[begin].insert(v2[begin].end(), v2[pivot].begin(), v2[pivot].end());
}
and
std::vector<BasicType> v1, v2;
std::vector<BasicType> *v1p, *v2p;
#pragma omp parallel
{
#pragma omp single
{
v1p = new std::vector<BasicType>[omp_get_num_threads()];
v2p = new std::vector<BasicType>[omp_get_num_threads()];
}
#pragma omp for
for(...) {
// Do some intensive stuff to compute val1 and val2
// ...
v1p[omp_get_thread_num()].push_back(val1);
v2p[omp_get_thread_num()].push_back(val2);
}
#pragma omp single
reduce(v1p, v2p, 0, omp_get_num_threads());
}
v1 = v1p[0], v2 = v2p[0];
delete[] v1p;
delete[] v2p;
For example with 8 threads this will join the vectors for threads
(0,1) (2,3) (4,5) (6,7)
(0,2) (4,6)
(0,4)
For more information about filling vectors in parallel see this. For more information about merging threads in log2(t)
operations see the answer to this question.
Upvotes: 2
Reputation: 22670
From your two suggestions, I would prefer the first. It is simpler and does not require additional memory. However, I would suggest an alternative without the critical
section:
std::vector<BasicType> vec1(size);
std::vector<BasicType> vec2(size);
#pragma opm parallel for
for(...)
{
// Do some intensive stuff to compute val1 and val2
// ...
vec1[i] = val1;
vec2[i] = val2;
}
Note that this may have some performance issues due to cache line stealing. However, I wouldn't worry about that unless it turns out to be an actual problem validated by actual performance analysis. A solution could then be:
Upvotes: 3