Shredderroy
Shredderroy

Reputation: 2920

What is the best way to determine the number of threads to fire off in a machine with n cores? (C++)

I have a vector<int> with 10,000,000 (10 million) elements, and that my workstation has four cores. There is a function, called ThrFunc, that operates on an integer. Assume that the runtime for ThrFunc for each integer in the vector<int> is roughly the same.

How should I determine the optimal number of threads to fire off? Is the answer as simple as the number of elements divided by the number of cores? Or is there a more subtle computation?

Editing to provide extra information

Upvotes: 25

Views: 11257

Answers (7)

aderchox
aderchox

Reputation: 4074

I've found a real world example I'll put here for the ones who want a less technical / more intuitional answer:

Having multiple threads per core is like having two queues in an airport for each scanner(which people on both queues eventually have to pass through).

Two people at a time can put their baggage on the conveyer belt, but only one at a time can pass through the scanner. Now at this point, obviously there's a contention point at the entrance of the scanner, but what happens in reality is most of the times both queues function very well.

In this example, the queues represent threads and the scanner is the main functions of a core. As a general rule of thumb, the impact of each thread is 1.25th a core, i.e., it's not like having an entire new core. So if the task is CPU-bound slightly over the number of available processors is probably best.

But notice that if the task is IO-Bound, where threads will be spending most of their time waiting for external resources such as database connections, file systems, or other external sources of data, then you can assign (many) more threads than the number of available processors.

Source1, Source2

Upvotes: -1

Olof Forshell
Olof Forshell

Reputation: 3274

The optimal number of cores (threads) will probably be determined by when you achieve saturation of the memory system (caches and RAM). Another factor that could come into play is that of inter-core locking (locking a memory area that other cores might want to access, updating it and then unlocking it) and how efficient it is (how long the lock is in place and how often it is locked/unlocked).

A single core running a generic software whose code and data are not optmized for multi-core will come close to saturating memory all by itself. Adding more cores will, in such a scenario, result in a slower application.

So unless your code economizes heavily on memory accesses I'd guess the answer to your question is one (1).

Upvotes: 0

jupp0r
jupp0r

Reputation: 4670

I agree with the previous comments. You should run tests to determine what number yields the best performance. However, this will only yield the best performance for the particular system you're optimizing for. In most scenarios, your program will be run on other people's machines, on the architecture of which you should not make too many assumptions.

A good way to numerically determine the number of threads to start would be to use

std::thread::hardware_concurrency()

This is part of the C++11 and should yield the number of logical cores in the current system. Logical cores means either the physical number of cores - in case the processor does not support hardware threads (ie HyperThreading) - or the number of hardware threads.

There's also a Boost-function that does the same, see Programmatically find the number of cores on a machine.

Upvotes: 4

sarnold
sarnold

Reputation: 104070

Borealid's answer includes test and find out, which is impossible to beat as advice goes.

But there's perhaps more to testing this than you might think: you want your threads to avoid contention for data wherever possible. If the data is entirely read-only, then you might see best performance if your threads are accessing "similar" data -- making sure to walk through the data in small blocks at a time, so each thread is accessing data from the same pages over and over again. If the data is completely read-only, then there is no problem if each core gets its own copy of the cache lines. (Though this might not make the most use of each core's cache.)

If the data is in any way modified, then you will see significant performance enhancements if you keep the threads away from each other, by a lot. Most caches store data along cache lines, and you desperately want to keep each cache line from bouncing among CPUs for good performance. In that case, you might want to keep the different threads running on data that is actually far apart to avoid ever running into each other.

So: if you're updating the data while working on it, I'd recommend having N or 2*N threads of execution (for N cores), starting them with SIZE/N*M as their starting point, for threads 0 through M. (0, 1000, 2000, 3000, for four threads and 4000 data objects.) This will give you the best chance of feeding different cache lines to each core and allowing updates to proceed without cache line bouncing:

+--------------+---------------+--------------+---------------+--- ...
| first thread | second thread | third thread | fourth thread | first ...
+--------------+---------------+--------------+---------------+--- ...

If you're not updating the data while working on it, you might wish to start N or 2*N threads of execution (for N cores), starting them with 0, 1, 2, 3, etc.. and moving each one forward by N or 2*N elements with each iteration. This will allow the cache system to fetch each page from memory once, populate the CPU caches with nearly identical data, and hopefully keep each core populated with fresh data.

+-----------------------------------------------------+
| 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 ... |
+-----------------------------------------------------+

I also recommend using sched_setaffinity(2) directly in your code to force the different threads to their own processors. In my experience, Linux aims to keep each thread on its original processor so much it will not migrate tasks to other cores that are otherwise idle.

Upvotes: 12

ciphor
ciphor

Reputation: 8288

The optimal number of threads should equal the number of cores, in which situation the computation capacity of each core will be fully utilized, if the computation on each element is independently.

Upvotes: 2

Andrew Cooper
Andrew Cooper

Reputation: 32576

Assuming ThrFunc is CPU-bound then you want probably one thread per core, and divide the elements between them.

If there's an I/O element to the function then the answer is more complicated, because you can have one or more threads per core waiting for I/O while another is executing. Do some tests and see what happens.

Upvotes: 5

Borealid
Borealid

Reputation: 98479

The optimal number of threads is likely to be either the number of cores in your machine or the number of cores times two.

In more abstract terms, you want the highest possible throughput. Getting the highest throughput requires the fewest contention points between the threads (since the original problem is trivially parallelizable). The number of contention points is likely to be the number of threads sharing a core or twice that, since a core can either run one or two logical threads (two with hyperthreading).

If your workload makes use of a resource of which you have fewer than four available (ALUs on Bulldozer? Hard disk access?) then the number of threads you should create will be limited by that.

The best way to find out the correct answer is, with all hardware questions, to test and find out.

Upvotes: 25

Related Questions