VB_
VB_

Reputation: 45732

Fork/Join: optimal threads count

Task definition: I need to map a very big array. For instance, let's it be afindMax() function. So the task is to do that as quick as possible (that mean in parallel).

HW: I've got 8 cores each of them has 2 hyper threads

public static void main(String... args) {
   int maxThreadAmount = Runtime.getRuntime().availableProcessors(); // GET 8
}

Solution#1: Just to run the task into N threads. Where N should be some optimal number.

Question#1: Is the next right: int optimalThreadAmount = maxThreadAmount - 1?

Solution#2: I want do solve that problem via Fork/Join framework. Each task splits into two parallel subtask if input is too large. So I'll get something like

                   Find Max 
                   [array]    <---- +1 pending thread
                   /        \
                  /          \
           Find Max        Find Max
          [1/2 array]      [1/2 array]  <-------- +2 pending threads
            /      \          /   \
           /        \        ..   ..
      Find Max      Find Max
     [1/4 array]    [1/4 array]      <-------- +4 four pending threads
       /       \        / \
      /         \      ..  ..
    Find Max    Find Max
    [1/8 array] [1/8 array]  <----------- +8 active threads

Question#2: Taking into account that with Fork/Join algorithm we'll get a bunch of pending threads what will be the optimal number of threads?

Upvotes: 3

Views: 3031

Answers (2)

Christophe De Troyer
Christophe De Troyer

Reputation: 2922

The optimal number of threads should be around the number of cores your machine has. However, keep in mind that when you split your workload over two subtasks, one task should be computed and the other one should be forked. This implies that you only create one extra thread when splitting.

Most fork/join algorithms are accompanied with a sequential cutoff. When you reach a certain condition (e.g., array to determine maximum value of is of size 1000) you switch to a sequential algorithm (i.e., checking the elements one by one). So if I were to guess the optimal situation to calculate your problem I would say that at the moment you have split 14 times, resulting in 16 threads, and then switch over to a sequential algorithm. This would mean that each core has a thread running and thus will stay busy. (This guess assumes your cores have something like hyperthreading, if not I'd say 8 threads).

Also, it's not recommended to hardcore the equation you gave (int optimalThreadAmount = maxThreadAmount - 1) because it implies you assume the machine to be doing nothing else and have all threads at your disposal.

My guess is that, when using a sequential cutoff, your optimal performance will be around 16 threads (when no other processes are using your machine). You can test for yourself, which always the best way to go. The issue you want to investigate is that when you start spawning a lot of threads, the overhead for each thread will become apparent.

Ps: the advantage of using fork/join is that your code will be able to scale well with respect to number of cores the machine has. More cores means that more threads will run in parallel. Which implies that your thread scheduler can put more threads to work.

Edit So what's is the best cutoff I should use for a given number of cores?

Well, my guess would be that you implement a fork/join algorithm. You have a sequential cutoff (i.e., stop forking and joining once my input array has size x).

When you know on which system you have to run the algorithm you could then run a benchmark.

For a sequential cutoff of x to y you run your code. Each iteration you measure how long it takes for you to apply your algorithm. Doing so you will see which configuration works best for you.

Then again, if you want a quick and dirty approach you could do the following:

Machine with p cores, input array has a size of s:

Sequential cutoff = s/8

However, I strongly advise against this, as said before.

Upvotes: 3

Ralf H
Ralf H

Reputation: 1474

OK, I suppose you want to map x to, say sqrt(x). You could concurrently obtain the square roots, and put them in a corresponding array. However, depending on the time needed for sqrt and reading/writing menory, you might saturate your mem access with two threads. Make sure each concurrent job runs for at least a µs or the fork/join overhead becomes too big.

Upvotes: 1

Related Questions