Reputation: 21
I have a small program that implements a monte carlo simulation of BlackJack using various card counting strategies. My main function basically does this:
int bankroll = 50000;
int hands = 100;
int tests = 10000;
Simulation::strategy = hi_lo;
for(int i = 0; i < simulations; ++i)
runSimulation(bankroll, hands, tests, strategy);
The entire program run in a single thread on my machine takes about 10 seconds.
I wanted to take advantage of the 3 cores my processor has so I decided to rewrite the program to simply execute the various strategies in separate threads like this:
int bankroll = 50000;
int hands = 100;
int tests = 10000;
Simulation::strategy = hi_lo;
boost::thread threads[simulations];
for(int i = 0; i < simulations; ++i)
threads[i] = boost::thread(boost::bind(runSimulation, bankroll, hands, tests, strategy));
for(int i = 0; i < simulations; ++i)
threads[i].join();
However, when I ran this program, even though I got the same results it took around 24 seconds to complete. Did I miss something here?
Upvotes: 2
Views: 293
Reputation: 190
I'm late to this party, but wanted to note two things for others who come across this post:
1) Definitely see the second Herb Sutter link that David points out (http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206). It solved the problem that brought me to this question, outlining a struct data object wrapper that ensures separate parallel threads aren't competing for resources headquartered on the same memory cache-line (hardware controls will prevent multiple threads from accessing the same memory cache-line simultaneously).
2) Re the original question, dlev points out a large part of the problem, but since it's a simulation I bet there's a deeper issue slowing things down. While none of your program's high-level variables are shared you probably have one critical system variable that's shared: the system-level "last random number" that's stored under-the-hood and used to create the next random number. You might even be initializing dedicated generator objects for each simulation, but if they're making calls to a function like rand() then they, and by extension their threads, are making repeated calls to the same shared system resource and subsequently blocking one another.
Solutions to issue #2 would depend on the structure of the simulation program itself. For instance if calls to a random generator are fragmented then I'd probably batch into one upfront call which retrieves and stores what the simulation will need. And this has me wondering now about more sophisticated approaches that'd deal with the underlying random generation shared-resource issue...
Upvotes: 0
Reputation: 3442
Read these articles from Herb Sutter. You are probably victim of "false sharing".
http://drdobbs.com/go-parallel/article/showArticle.jhtml?articleID=214100002
http://drdobbs.com/go-parallel/article/showArticle.jhtml?articleID=217500206
Upvotes: 1
Reputation: 6901
I agree with dlev . If your function runSimulation is not changing anything which will be required for the next call to "runSimulation" to work properly then you can do something like:
. Divide "simulations" by 3.
. Now you will be having 3 counters "0 to simulation/3" "(simulation/3 + 1) to 2simulation/3" and "(2*simulation)/3 + 1 to simulation".
All these 3 counters can be used in three different threads simultaneously.
**NOTE ::** Your requirement might not be suitable for this type of checkup at all in case you have to do shared data lockup and all
Upvotes: 0
Reputation: 48596
If the value of simulations
is high, then you end up creating a lot of threads, and the overhead of doing so can end up destroying any possible performance gains.
EDIT: One approach to this might be to just start three threads and let them each run 1/3 of the desired simulations. Alternatively, using a thread pool of some kind could also help.
Upvotes: 5
Reputation: 986
This is a good candidate for a work queue with thread pool. I have used Intel Threading Blocks (TBB) for such requirements. Use handcrafted thread pools for quick hacks too. On Windows, the OS provides you with a nice thread pool backed work queue "QueueUserWorkItem()"
Upvotes: 2