PidgeyBAWK
PidgeyBAWK

Reputation: 319

C++ OpenMP Tasks - passing by reference issue

I am currently working on a system in which I reading in a file of over ~200 million records (lines), so I am buffering the records and using OpenMP tasks to manage each batch while continuing to process input. Each record in the buffer takes roughly 60μ to process in work_on_data, and will generate a string result. To avoid critical regions, I create a vector for results, and pass record placeholders (that I insert into this vector) by address to the work_on_data function :

int i = 0;
string buffer[MAX_SIZE];
vector<string> task_results;

#pragma omp parallel shared(map_a, task_results), num_threads(X) 
#pragma omp single
{
    while (getline(fin, line) && !fin.eof())
    {
        buffer[i] = line;
        if (++i == MAX_SIZE)
        {
            string result = "";
            task_results.push_back(result);
#pragma omp task firstprivate(buffer)
            work_on_data(buffer, map_a, result);
            i = 0;
        }
    }
}

// eventually merge records in task_results

At the end of work_on_data, each result passed in will not be an empty string (as initialized). However, when merging results, each result is still an empty string. I may be doing something stupid here regarding scoping/addressing, but I don't see what the problem is. Any thoughts?

Thanks in advance.

Upvotes: 1

Views: 983

Answers (1)

jepio
jepio

Reputation: 2281

Pushing something into a vector causes a copy of it to be constructed inside the vector. So your work_on_data function doesn't get a reference to the string inside the vector, but to the string inside the if block. To fix this you could rewrite your code to give it access to the last element after the push_back, like so:

if (++i == MAX_SIZE)
{
    task_results.push_back("");
#pragma omp task firstprivate(buffer)
    work_on_data(buffer, map_a, task_results.back());
    i = 0;
}

Edit:

I had forgotten about iterator invalidation on vector reallocation, and additionally the call to back() leads to race conditions. With (smart) pointers (as the comments are suggesting) and a dedicated counter this works for me with no segfault:

vector<shared_ptr<string>> task_results;

int ctr = 0
...
if (++i == MAX_SIZE) {
    task_results.push_back(make_shared<string>());
#pragma omp task firstprivate(buffer, ctr) 
    work_on_data(buffer, map_a, *task_results.back[ctr]);
    i = 0;
    ++ctr;

}

I think the back() version segfaults because that function is being called by many different threads at the same time and if the main thread manages to push_back somewhere in between as well, threads would be working on the same data.

Upvotes: 2

Related Questions