Reputation: 20616
It is safe to read a STL container from multiple parallel threads. However, the performance is terrible. Why?
I create a small object that stores some data in a multiset. This makes the constructors fairly expensive ( about 5 usecs on my machine. ) I store hundreds of thousands of the small objects in a large multiset. Processing these objects is an independent business, so I split the work between threads running on a multi-core machine. Each thread reads the objects it needs from the large multiset, and processes them.
The problem is that the reading from the big multiset does not proceed in parallel. It looks like the reads in one thread block the reads in the other.
The code below is the simplest I can make it and still show the problem. First it creates a large multiset containing 100,000 small objects each containing its own empty multiset. Then it calls the multiset copy constructor twice in series, then twice again in parallel.
A profiling tool shows that the serial copy constructors take about 0.23 secs, whereas the parallel ones take twice as long. Somehow the parallel copies are interfering with each other.
// a trivial class with a significant ctor and ability to populate an associative container
class cTest
{
multiset<int> mine;
int id;
public:
cTest( int i ) : id( i ) {}
bool operator<(const cTest& o) const { return id < o.id; }
};
// add 100,000 objects to multiset
void Populate( multiset<cTest>& m )
{
for( int k = 0; k < 100000; k++ )
{
m.insert(cTest(k));
}
}
// copy construct multiset, called from mainline
void Copy( const multiset<cTest>& m )
{
cRavenProfile profile("copy_main");
multiset<cTest> copy( m );
}
// copy construct multiset, called from thread
void Copy2( const multiset<cTest>& m )
{
cRavenProfile profile("copy_thread");
multiset<cTest> copy( m );
}
int _tmain(int argc, _TCHAR* argv[])
{
cRavenProfile profile("test");
profile.Start();
multiset<cTest> master;
Populate( master );
// two calls to copy ctor from mainline
Copy( master );
Copy( master );
// call copy ctor in parrallel
boost::thread* pt1 = new boost::thread( boost::bind( Copy2, master ));
boost::thread* pt2 = new boost::thread( boost::bind( Copy2, master ));
pt1->join();
pt2->join();
// display profiler results
cRavenProfile print_profile;
return 0;
}
Here is the output
Scope Calls Mean (secs) Total
copy_thread 2 0.472498 0.944997
copy_main 2 0.233529 0.467058
Upvotes: 3
Views: 1064
Reputation: 20616
OK, after spending the majority of the week on this issues, I have the fix.
There were two problems with the code I posted in the question:
boost::bind makes a copy of its parameters, even if the underlying function uses call by reference. Copying the container is expensive, and so the multi-threaded version was working too hard. ( No-one noticed this! ) To pass the container by reference, I needed to use this code:
boost::thread* pt1 = new boost::thread( boost::bind( Copy2, boost::cref(master) ));
As Zan Lynx pointed out the default container allocates memory for its contents on the global heap using a thread safe singleton memory allocator, resulting in great contention between the threads as they created hundreds of thousands of objects through the same allocator instance. ( Since this was the crux of the mystery, I accepted Zan Lynx's answer. )
The fix for #1 is straightforward, as presented above.
The fix for #2 is, as several people pointed out, to replace the default STL allocator with a thread specific one. This is quite the challenge, and no-one offered a specific source for such an allocator.
I spent some time looking for a thread specific allocator "off the shelf". The best I found was hoard ( hoard.org ). This provided a significant performance improvement, however hoard has some serious drawbacks
So I decided to roll my own thread specific memory allocator, based on boost::pool and boost::threadspecificptr. This required a small amount of, IMHO, seriously advanced C++ code, but now seems to be working well.
Upvotes: 2
Reputation: 20616
To answer Pavel Shved with more detail, here is how the majority of my code runs:
step 0 1 2 3 4 5 6 7 8 9
core1: 1 1 1 1 1
core2: 2,2,2,2,2
sequential: 1,1,1,1,1,2,2,2,2,2
Only the parallel reads interfere with each other.
As an experiment, I replaces the big multiset with an array of pointers to cTest. The code now has huge memory leaks, but never mind. The interesting thing is that the relative performance is worse - running the copy constructors in parallel slows them down 4 times!
Scope Calls Mean (secs) Total
copy_array_thread 2 0.454432 0.908864
copy_array_main 2 0.116905 0.233811
Upvotes: 0
Reputation: 54393
You mentioned copy constructors. I assume that these also allocate memory from the heap?
Allocating heap memory in multiple threads is a big mistake.
The standard allocator is probably a single pool locked implementation. You need to either not use heap memory (stack allocate) or you need a thread optimized heap allocator.
Upvotes: 10
Reputation: 264739
Since I am not sure how your profiler is working it is hard to tell.
What I would prefer to see is some explicit timing around the code:
Then do the work a couple of times to average out any thing that causing a context switch.
for(int loop=0;loop < 100;++loop)
{
ts = timer();
Copy( master );
Copy( master );
te = timer();
tt += te - ts;
}
tt /= 100;
etc Compare this with your profiler results.
Upvotes: 0
Reputation: 99424
What's the scheduling of your threads? If you run two threads, doing the considerable work, the threads most likely start at once and end at once. Hence the profiler thinks that the execution of each thread has taken twice as much time, because during the time each thread executes the work is done twice. Whereas the execution of each of sequential calls took the normal time.
step 0 1 2 3 4 5 6 7 8 9
threaded: 1,2,1,2,1,2,1,2,1,2
sequential: 1,1,1,1,1,2,2,2,2,2
Thread one started at 0 and ended at 8, showing execution time as 8; thread 2 started at 1 and ended at 9, the execution time is 8. Two sequential runs show 5 steps each. So in the resultant table you'll see 16 for concurrent version and 10 for the sequential one.
Assuming that all above is true and there is a considerable number of steps, the ratio of the execution times shown by your profiler should be about two. Experiment does not contradict this hypothesis.
Upvotes: 0