zzzbbx
zzzbbx

Reputation: 10131

efficiency in multithreading

suppose I have a code like this

for(i = 0; i < i_max; i++)
  for(j = 0; j < j_max; j++)
     // do something

and I want to do this by using different threads (assuming the //do something tasks are independent from each other, think about montecarlo simulations for instance). My question is this: is it necessarily better to create a thread for each value of i, than creating a thread for each value of j? Something like this

for(i = 0; i < i_max; i++)
  create_thread(j_max);

additionally: what would a suitable number of threads? Shall I just create i_max threads or, perhaps, use a semaphore with k < i_max threads running concurrently at any given time.

Thank you,

Upvotes: 1

Views: 528

Answers (5)

Steve Townsend
Steve Townsend

Reputation: 54128

The best way to apportion the workload is workload-dependent.

Broadly - for parallelizable workload, use OpenMP; for heterogeneous workload, use a thread pool. Avoid managing your own threads if you can.

Monte Carlo simulation should be a good candidate for truly parallel code rather than thread pool.

By the way - in case you are on Visual C++, there is in Visual C++ v10 an interesting new Concurrency Runtime for precisely this type of problem. This is somewhat analogous to the Task Parallel Library that was added to .Net Framework 4 to ease the implementation of multicore/multi-CPU code.

Upvotes: 4

miked
miked

Reputation: 3598

Everyone here is basically right, but here's a quick-and-dirty way to split up the work and keep all of the processors busy. This works best when 1) creating threads is expensive compared to the work done in an iteration 2) most iterations take about the same amount of time to complete

First, create 1 thread per processor/core. These are your worker threads. They sit idle until they're told to do something.

Now, split up your work such that work that data that is needed at the same time is close together. What I mean by that is that if you were processing a ten-element array on a two processor machine, you'd split it up so that one group is elements 1,2,3,4,5 and the other is 6,7,8,9,10. You may be tempted to split it up 1,3,5,7,9 and 2,4,6,8,10, but then you're going to cause more false sharing (http://en.wikipedia.org/wiki/False_sharing) in your cache.

So now that you have a thread per processor and a group of data for each thread, you just tell each thread to work on an independent group of that data.

So in your case I'd do something like this.

for (int t=0;t<n_processors;++t)
{
  thread[t]=create_thread();
  datamin[t]=t*(i_max/n_processors);
  datamax[t]=(t+1)*(i_max/n_processors);
}

for (int t=0;t<n_processors;++t)
  do_work(thread[t], datamin[t], datamax[t], j_max)

//wait for all threads to be done

//continue with rest of the program.

Of course I left out things like dealing with your data not being an integer multiple of the number of processors, but those are easily fixed.

Also, if you're not adverse to 3rd party libraries, Intel's TBB (threading building blocks) does a great job of abstracting this from you and letting you get to the real work you want to do.

Upvotes: 2

Rich Turner
Rich Turner

Reputation: 10984

Avoid creating threads unless you can keep them busy!

If your scenario is compute-bound, then you should minimize the number of threads you spawn to the number of cores you expect your code to run on. If you create more threads than you have cores, then the OS has to waste time and resources scheduling the threads to execute on the available cores.

If your scenario is IO-bound, then you should consider using async IO operations that are queued and which you check the response codes from after the async result is returned. Again, in this case, spawning a thread per IO operation is hugely wasteful as you'll cause the OS to have to waste time scheduling threads that are stalled.

Upvotes: 2

user257111
user257111

Reputation:

Depends on the tasks and on what platform you're about to simulate on. For example, on CUDA's architecture you can split the tasks up so that each i,j,1 is done individually.

You still have the time to load data onto the card to consider.

Using for loops and something like OpenMP/MPI/your own threading mechanism, you can basically choose. In one scenario, parallel threads are broken out and j is looped sequentially on each thread. In the ohter, a loop is processed sequentially, and a loop is broken out in each parallelisation.

Parallelisation (breaking out threads) is costly. Remember that you have the cost of setting up n threads, then synchronising n threads. This represents a cost c over and above the runtime of the routines which in and of itself can make the total time greater for parallel processing than in single threaded mode. It depends on the problem in question; often, there's a critical size beyond which parallel is faster.

I would suggest breaking out into the parallel zone in the first for loop would be faster. If you do so on the inner loop, you must fork/join each time the loop runs, adding a large overhead to the speed of the code. You want, ideally, to have to create threads only once.

Upvotes: 0

Konrad Rudolph
Konrad Rudolph

Reputation: 545488

Everything around creating and calling threads is relatively expensive so you want to do that as little as possible.

If you parallelize your inner loop instead of the outer loop, then for each iteration of the outer loop j_max threads are created. An order of i_max more than if you parallelized the outer loop instead.

That said, the best parallelization depends on your actual problem. Depending on that, it can actually make sense to parallelize the inner loop instead.

Upvotes: 1

Related Questions