Peter Kraus
Peter Kraus

Reputation: 79

Thread synchronisation for very short tasks

I have a C++ application running on winapi. Portability is not an issue. All I want is maximum performance. I have a basic understanding of multithreading and synchronization issues, but limited experience with the multitude of options ranging from winapi over C++ threads to third party libraries.

In the performance critical core of my application I identified a loop, which could be parallelized. I managed to split the loop into 4 parts which do not depend on each other. I would like to delegate the job to 4 threads running in parallel. The main thread should wait until all 4 threads have done their job, before it continues.

Sounds very simple. However, currently the loop takes only about 10 microseconds when running on one thread. I'm afraid that synchronization methods which cause a switch to the kernel (events, mutexes, etc.) would produce more overhead than the parallelization could save. SRWLocks + condition variables claim to be very lightweight, but I didn't find a way to solve my synchronization with these tools.

Of course I could test all kinds of synchronization APIs, but I'm sure this has been done before.

So my question is: Is there a reasonable way to synchronize very short tasks and if so, what are the appropriate tools?

Upvotes: 0

Views: 141

Answers (2)

David Schwartz
David Schwartz

Reputation: 182753

The update algorithm consists of two steps. Each of these steps can be applied to the knots in arbitrary order, but step 1 must be completed before step 2 can start. I can portion the whole net into four (or more) parts and delegate each part to a separate thread. My problem is: Each thread has to pause after step 1 and wait until all threads have finished their job. Then each thread makes step 2, wait for completion of the other threads and so on.

You want to break the work into a large number of small chunks and have a fixed pool of threads take chunks of work. Do not make 8 threads on an 8 core machine and split the work into 8 chunks. That algorithm will work poorly if, for one reason or another, only 7 of those cores winds up doing work for you. Your algorithm will need twice as long as the second half of the time only one core is working.

The easy way is to have an extra dispatch thread. Just keep a "work unit" count somewhere protected by a mutex. When a thread finishes a work unit, have it decrement the "work unit" count. When it hits zero, broadcast a condition variable. That will wake the dispatch thread which will then do whatever it takes to get the worker threads going again. It can start them by setting the "work unit" count to the right level and broadcasting another condition variable that the worker threads wait for.

You can also just keep a count of which node needs to be done next and the number of nodes currently doing work. That will require synchronization after each thread though (to figure out which node to do next) and it may make more sense to have each thread grab some number of nodes, iterate over them, and then synchronize to grab another few nodes.

Avoid breaking the work into large chunks early. That can lead to the problem where you have 8 cores but 2 large work units left at some point. Remember, many modern CPUs run their cores at different speeds based on temperature and power measurements.

Upvotes: 0

SoronelHaetir
SoronelHaetir

Reputation: 15164

If you simply need to wait for threads to complete you would use WaitForMultipleObjects on the thread handles. The other direct option would be to use a synchronization barrier, a primitive that allows a group of threads to halt until all members of the group have reached the barrier, but that is generally for the case where there is more work for the spawned threads to perform after being released.

Your question of whether this would actually be of benefit in your particular case is one that can only be answered through implementation and timing. And note that if you are going to perform this testing it should be done on a release build with optimizations enabled. It may well be the case that if the amount of work to perform is short enough that the time involved in thread management dwarfs any benefit.

Upvotes: 1

Related Questions