Reputation: 663

Large number of threads in C++ and Efficiency

I've currently written a program in C++ that sometimes uses over 300 threads. In my program, I have an array of structs and the length of the array equals the number of threads. Let's assume that I have 400 structs and therefore, 400 threads.

In a single iteration of a for loop, I apply a function to each of the 400 structs, and this function is executed in a thread. Therefore, I have 400 threads running concurrently. (I am using the boost thread library).

I've tried to give a breakdown of what my code looks like (it is not the actual code):

struct my_struct{
  // Structure's members
};

std::vector<my_struct> my_vec;

void my_fun(my_struct* my_str){
// Operations on my_str
}

int main(){
  std::vector<boost::thread> thr(400);
  for (int k = 0; k < 300; k++){
    for (int i = 0; i < 400; i++){
      thr.at(i) = boost::thread(my_fun, &my_vec.at(i));
      }
    }

    for (int m = 0; m < M; m++){
      thr.at(m).join();
    }
  }
}

The function I am using is computationally intensive, and from the code above, I use 400 threads to do calculations and this is done 300 times. Is there any more efficient way of performing this task? I'm not sure if having so many threads active at a single time may affect performance. I've heard of the threadpool library, but I'm not sure whether it'll provide any benefit to me. Any help is appreciated.

Thank You Very Much.

Upvotes: 6

Answers (6)

kqnr

Reputation: 3596

There is absolutely no benefit to spawning 400 CPU-bound threads unless you have 400+ processor cores in your target machine.

It would be impossible to tell you with any certainty how to better distribute your workload without knowing what sort of computations you're performing, and on what kind of data.

As a shot in the dark, judging from what you have posted, a first stab would be to use N threads (see below), and divide your 400 objects among them so that each thread is responsible for processing approximately 400/N objects. Each thread can loop 300 times, and on each iteration it can process each of its assigned objects.

N is an arbitrary number; in fact, I recommend trying different values and comparing the performance results. However, unless your threads are performing I/O or other operations that waste time blocking on non-computational operations, N should be no larger than the number of processor cores in your machine (try it and watch your performance drop quickly).

Edit: As per the ongoing discussion, it would be advisable to employ a queue of your objects from which each of your N threads can simply pop as they are ready for more work. The queue will of course need to be thread-safe. For optimal performance, a lock-free queue should be implemented. There's a good paper here. The implementation should be simplified by the fact that you are fully populating the queue once and therefore only need thread-safe reads.

Upvotes: 15

johnnycrash

Reputation: 5344

First, more than the max number of simultaneous threads is a waste. 1 core with hyperthreading or SMT or whatever the chip manufacturer wants to call it has 2 or more simultaneous threads. You have to figure out how many simultaneous threads your cores can handle and multiply that by the # of cores. No need to do more threads than that. You had 400 threads. At any one time, probably 396 of them were asleep.

Rather than worrying about cache line alignment you need to worry about "locality". When you spin through data larger than the L2 cache, every memory access is a slow memory access all the way out to RAM. If you spin through data that is smaller than L2 cache all memory access is in L2 cache which is ~100x faster. Also, if all the data accesses are slow ones, then all the execution threads on the cpu will be stalled. SMT only works because, more often than not, one thread is stalled waiting for ram, so the CPU can execute the other thread. If you do things wrong enough and stall all threads then you have basically disabled SMT. You now have no simultaneous threads.

So...if your dataset is larger than L2 cache, you need to "strip mine". Break the calculation into parts small enough to fit in L2 cache. For example, if you have a matrix, then divide the matrix into n x m squares that can fit into L2 cache and let the correct # of threads work on that. When that strip is done, move to the next and so on. If you do this right, your code could become 100x faster.

Another way to increase locality is to shrink your data. Make the data as small as possible. The smaller the data, the more it stays in the L2 cache.

Upvotes: 2

paxdiablo

Reputation: 881463

The only way in which it is beneficial to have more threads than there are actual execution engines (CPUs or cores or whatever is being used - I'll just call them CPUs here) is if some of the threads are actually waiting on resources other than those CPUs.

If the threads are CPU-bound, the ideal number is equivalent to the number of CPUs you have available to you. If many of the threads are waiting on file I/O or database accesses or network traffic or OS events (and so on), then a couple of hundred may be okay. But, in your case, that appears not to be so.

A thread pool is really a way to avoid continuously creating and destroying threads in situations where that could be relatively inefficient. For example, if it takes ten seconds to start a thread and each only does one second of work, then a thread pool would be ideal.

Given that you're likely to be reducing the number of threads to something substantially less than four hundred (say about two or four), which will in turn ramp up the work done by each, a thread pool may not be needed. But again, it depends on the amount of work that will be done by the threads compared with their startup cost.

To keep it simple, I'd start with the non pool version and only contemplate changing if there's a serious performance issue. Otherwise you may be giving yourself extra work for no real benefit.

You can still divide up your work into four hundred units but the best approach will be to simply queue them up and have each of your threads pull an item from the queue when it's ready to process one. That way, the work is automatically balanced among CPUs. If, for some bizarre reason, CPU number 1 runs twice as fast as the others, it will automatically get twice as much of the workload.

This is more important than you think, simply because it's a near certainty that the CPUs will be doing other stuff as well - they're unlikely to be totally dedicated to just doing this work.

Upvotes: 4

murrekatt

Reputation: 6129

For computation intensive work you will be bounded by the number of cores you have. Therefore it is recommended to use as many threads as you have cores.

Divide the work into the number of cores you have and create the same number of threads and run this.

If all the work items are independent, you just divide into equal sized groups. If there is a dependency between the work items (item1's result is needed by item2), then you need to divide into some thing which makes sense based on the dependency.

Upvotes: 2

INS

Reputation: 10820

On a single processor computer you're most likely to be slower with multi threaded than with single threaded if all you do is computation because of the context switch.

Usually if some threads are waiting for some peripheral equipments then a multi threaded approach might offer some flexibility to your application.

In you case - CPU intensive tasks - I doubt that a multi threaded approach will bring performance to your application.

Upvotes: 1

Jeff Foster

Reputation: 44706

Hundreds of threads sounds like a problem for computationally expensive tasks. Chances are that the program is spending more time context switching than processing. Try using N threads (where N is the number of cores in your machine) and chunking the work up into bigger units.

Upvotes: 3

Large number of threads in C++ and Efficiency

Answers (6)

Related Questions