Reputation: 5856
I have the following code, that starts multiple Threads (a threadpool) at the very beginning (startWorkers()
). Subsequently, at some point i have a container full of myWorkObject
instances, which I want to process using multiple worker threads simulatenously. The myWorkObject
are completely isolated from another in terms of memory usage. For now lets assume myWorkObject has a method doWorkIntenseStuffHere()
which takes some cpu time to calculate.
When benchmarking the following code, i have noticed that this code does not scale well with the number of threads, and the overhead for initializing/synchronizing the worker threads exceeds the benefit of multithreading unless there are 3-4 threads active. I've looked into this issue and read about the false-sharing problem and i assume my code suffers from this problem. However, I'd like to debug/profile my code to see whether there is some kind of starvation/false sharing going on. How can I do this? Please feel free to critize anything about my code as I'm still learning a lot about memory/cpu and multithreading in particular.
#include <boost/thread.hpp>
class MultiThreadedFitnessProcessingStrategy
{
public:
MultiThreadedFitnessProcessingStrategy(unsigned int numWorkerThreads):
_startBarrier(numWorkerThreads + 1),
_endBarrier(numWorkerThreads + 1),
_started(false),
_shutdown(false),
_numWorkerThreads(numWorkerThreads)
{
assert(_numWorkerThreads > 0);
}
virtual ~MultiThreadedFitnessProcessingStrategy()
{
stopWorkers();
}
void startWorkers()
{
_shutdown = false;
_started = true;
for(unsigned int i = 0; i < _numWorkerThreads;i++)
{
boost::thread* workerThread = new boost::thread(
boost::bind(&MultiThreadedFitnessProcessingStrategy::workerTask, this,i)
);
_threadQueue.push_back(new std::queue<myWorkObject::ptr>());
_workerThreads.push_back(workerThread);
}
}
void stopWorkers()
{
_startBarrier.wait();
_shutdown = true;
_endBarrier.wait();
for(unsigned int i = 0; i < _numWorkerThreads;i++)
{
_workerThreads[i]->join();
}
}
void workerTask(unsigned int id)
{
//Wait until all worker threads have started.
while(true)
{
//Wait for any input to become available.
_startBarrier.wait();
bool queueEmpty = false;
std::queue<SomeClass::ptr >* myThreadq(_threadQueue[id]);
while(!queueEmpty)
{
SomeClass::ptr myWorkObject;
//Make sure queue is not empty,
//Caution: this is necessary if start barrier was triggered without queue input (e.g., shutdown) , which can happen.
//Do not try to be smart and refactor this without knowing what you are doing!
queueEmpty = myThreadq->empty();
if(!queueEmpty)
{
chromosome = myThreadq->front();
assert(myWorkObject);
myThreadq->pop();
}
if(myWorkObject)
{
myWorkObject->doWorkIntenseStuffHere();
}
}
//Wait until all worker threads have synchronized.
_endBarrier.wait();
if(_shutdown)
{
return;
}
}
}
void doWork(const myWorkObject::chromosome_container &refcontainer)
{
if(!_started)
{
startWorkers();
}
unsigned int j = 0;
for(myWorkObject::chromosome_container::const_iterator it = refcontainer.begin();
it != refcontainer.end();++it)
{
if(!(*it)->hasFitness())
{
assert(*it);
_threadQueue[j%_numWorkerThreads]->push(*it);
j++;
}
}
//Start Signal!
_startBarrier.wait();
//Wait for workers to be complete
_endBarrier.wait();
}
unsigned int getNumWorkerThreads() const
{
return _numWorkerThreads;
}
bool isStarted() const
{
return _started;
}
private:
boost::barrier _startBarrier;
boost::barrier _endBarrier;
bool _started;
bool _shutdown;
unsigned int _numWorkerThreads;
std::vector<boost::thread*> _workerThreads;
std::vector< std::queue<myWorkObject::ptr >* > _threadQueue;
};
Upvotes: 1
Views: 638
Reputation: 20080
If you're on Linux, there is a tool called valgrind, with one of the modules doing cache effects simulation (cachegrind). Please take a look at
http://valgrind.org/docs/manual/cg-manual.html
Upvotes: 1
Reputation: 1755
Sampling-based profiling can give you a pretty good idea whether you're experiencing false sharing. Here's a previous thread that describes a few ways to approach the issue. I don't think that thread mentioned Linux's perf utility. It's a quick, easy and free way to count cache misses that might tell you what you need to know (am I experiencing a significant number of cache misses that correlates with how many times I'm accessing a particular variable?).
If you do find that your threading scheme might be causing a lot of conflict misses, you could try declaring your myWorkObject instances or the data contained within them that you're actually concerned about with __attribute__((aligned(64)))
(alignment to 64 byte cache lines).
Upvotes: 1