Vincent COISY
Vincent COISY

Reputation: 75

Vector filling across OpenMP threads

I have an algorithm for which one goal is to fill vectors. For the sake of performance, iterations of the algorithm are spread across OpenMP threads. I was wondering which way would provide the better/safer way of filling the vectors.

Note that the ordering of the vectors must be consistent (i.e. value n of vec1 must come from the same iteration than value n of vec2.)

Hypothesis 1:

std::vector<BasicType> vec1;
std::vector<BasicType> vec2;
#pragma opm parallel for
for(...)
{
    // Do some intensive stuff to compute val1 and val2
    // ...

    #pragma omp critical
    {
        vec1.push_back(val1);
        vec2.push_back(val2);
    }
}
// Then go on to work with vec1 and vec2...

Hypothesis 2:

std::vector<BasicType> vec1;
std::vector<BasicType> vec2;
#pragma opm parallel
{
    std::vector<BasicType> vec1local;
    std::vector<BasicType> vec2local;
    #pragma omp for
    for(...)
    {
        // Do some intensive stuff to compute val1 and val2
        // ...

        vec1local.push_back(val1);
        vec2local.push_back(val2);
    }

    #pragma omp critical
    {
        // See note 1 below
        vec1.insert(vec1.end(), vec1local.begin(), vec1local.end());
        vec2.insert(vec2.end(), vec2local.begin(), vec2local.end());
    }
}
// Then go on to work with vec1 and vec2...

Note 1: This was taken from Best way to append vector to vector

Note 2: It seems val1 and val2 could be combined in some object in order to maintain consistency and work only with one vector, but for now it seems inpractical for the remainder of the algorithm.

Note 3: To give an order of magnitude, the for loop contains around 100 iterations splitted among 4 threads. Apart for very few exceptions, each iteration is expected to have the same workload (which brings the issue of the critical sections happening around the same time.)

Note 4: Just for the records, the whole thing deals with image stabilisation, and runs on a Tegra K1 architecture. Compiler used is gcc 4.8.4.

Upvotes: 7

Views: 5421

Answers (2)

Z boson
Z boson

Reputation: 33659

I am going to assume that you need to use a vector and cannot use an array (othrewise your question is not very interesting). Using t = omp_get_num_threads(), you fill the vectors in parallel and then merge them in log2(t) operations instead of toperations (as you are doing now) like this

void reduce(std::vector<BasicType> *v1, std::vector<BasicType> *v2, int begin, int end) {
    if(end - begin == 1) return;
    int pivot = (begin+end)/2;
    #pragma omp task
    reduce(v, begin, pivot);
    #pragma omp task
    reduce(v, pivot, end);
    #pragma omp taskwait
    v1[begin].insert(v1[begin].end(), v1[pivot].begin(), v1[pivot].end());
    v2[begin].insert(v2[begin].end(), v2[pivot].begin(), v2[pivot].end());
}

and

std::vector<BasicType> v1, v2;
std::vector<BasicType> *v1p, *v2p;
#pragma omp parallel
{
    #pragma omp single
    {
        v1p = new std::vector<BasicType>[omp_get_num_threads()];
        v2p = new std::vector<BasicType>[omp_get_num_threads()];
    }
    #pragma omp for
    for(...) {
        // Do some intensive stuff to compute val1 and val2
        // ...
       v1p[omp_get_thread_num()].push_back(val1);
       v2p[omp_get_thread_num()].push_back(val2);
    }
    #pragma omp single
    reduce(v1p, v2p, 0, omp_get_num_threads());
}
v1 = v1p[0], v2 = v2p[0];
delete[] v1p;
delete[] v2p;

For example with 8 threads this will join the vectors for threads

(0,1) (2,3) (4,5) (6,7)
(0,2) (4,6)
(0,4)

For more information about filling vectors in parallel see this. For more information about merging threads in log2(t) operations see the answer to this question.

Upvotes: 2

Zulan
Zulan

Reputation: 22670

From your two suggestions, I would prefer the first. It is simpler and does not require additional memory. However, I would suggest an alternative without the critical section:

std::vector<BasicType> vec1(size);
std::vector<BasicType> vec2(size);
#pragma opm parallel for
for(...)
{
    // Do some intensive stuff to compute val1 and val2
    // ...

    vec1[i] = val1;
    vec2[i] = val2;
}

Note that this may have some performance issues due to cache line stealing. However, I wouldn't worry about that unless it turns out to be an actual problem validated by actual performance analysis. A solution could then be:

  • Go with your second solution. (which costs memory and additional post-processing)
  • Align your vector and use appropriate chunks for the loop. (such that each thread has local cache lines)
  • Use a data structure that internally contains the local vectors, but to the outside provides the necessary global vector-operations. (This is probably overall the best solution.)

Upvotes: 3

Related Questions