Reputation: 1197
I am writing a program where a bunch of different classes, all stored in a vector, do parallel operations on private members using public data structures. I'd like to parallelize it for multiple processors using OpenMP, but I have two questions about two of the operations in the code, both of which are indicated in comments in the example below that shows a reduced form of the program's logic.
#include <omp.h>
#include <iostream>
#include <sys/timeb.h>
#include <vector>
class A {
private :
long _i;
public :
void add_i(long &i) { _i += i; }
long get_i() const { return _i; }
};
int main()
{
timeb then;
ftime(&then);
unsigned int BIG = 1000000;
int N = 4;
std::vector<long> foo(BIG, 1);
std::vector<A *> bar;
for (unsigned int i = 0; i < N; i++)
{
bar.push_back(new A());
}
#pragma omp parallel num_threads(4)
{
for(long i = 0; i < BIG; i++)
{
int thread_n = omp_get_thread_num();
// read a global variable
long *to_add = &foo[i];
// write to a private variable
bar[thread_n]->add_i(*to_add);
}
}
timeb now;
ftime(&now);
for (int i = 0; i < N; i++)
{
std::cout << bar[i]->get_i() << std::endl;
}
std::cout << now.millitm - then.millitm << std::endl;
}
The first comment addresses the read from the global foo. Is this "false sharing" (or data sloshing)? Most of the resources I read talk about false sharing in terms of write operations, but I don't know if the same applies to read operations.
The second comment addresses writing operations to classes in bar. Same question: is this false sharing? They are writing to elements in the same global data structure (which is, from what I've read, sloshing), but only ever acting on private data within the elements.
When I replace the OpenMP macro with a for loop, the program is faster by about 25%, so I'm guessing I'm doing something wrong...
Upvotes: 2
Views: 802
Reputation: 74365
Modern memory allocators are thread-aware. To prevent false sharing when it comes to modifying each instance of class A
pointed to by the elements of bar
, you should move the memory allocation inside the parallel region, e.g.:
const int N = 4;
std::vector<long> foo(BIG, 1);
std::vector<A *> bar(N);
#pragma omp parallel num_threads(N)
{
int thread_n = omp_get_thread_num();
bar[thread_n] = new A();
for(long i = 0; i < BIG; i++)
{
// read a global variable
long *to_add = &foo[i];
// write to a private variable
bar[thread_n]->add_i(*to_add);
}
}
Note also that in this case omp_get_thread_num()
is called only once as opposed to BIG
times as in your code. The overhead of calling a function is relatively low, but it adds up when you do that many times.
Upvotes: 1
Reputation: 129314
Your biggest sharing problem is the bar[thread_n]
. The reading of foo[i]
is less of an issue.
Edit: Since bar[thread_n]
is holding pointers, and the pointee is what gets updated, there is little or no sharing. You may still benefit from loading a "lump at a time" into each CPU-core, rather than reading one or two items per CPU-core from each cache-line. So below code may still benefit. As always when it's a matter of performance, benchmark a lot (with optimisation enabled), as different systems will behave different (depending on compiler, CPU architecture, memory subsystems, etc, etc)
It would be better to "lump" a few items at a time in each thread. Something like this, perhaps:
const int STEP=16;
for(long i = 0; i < BIG; i+=STEP)
{
int thread_n = omp_get_thread_num();
int stop = std::min(BIG-i, STEP); // Don't go over edge.
for(j = 0; j < stop; j++)
{
// read a global variable
long *to_add = &foo[i+j];
// write to a private variable
bar[thread_n*STEP + j]->add_i(*to_add);
}
}
You may need to adjust "STEP" a bit to make it the right level.
Upvotes: 0