Reputation: 10535
I wrote a parallel java program. It works typically:
String input
as input;input
is cut into String inputs[numThreads]
evenly; inputs[i]
is assigned to thread_i
to process, and generates results[i]
;main
thread merge the results[i]
into result
.The performance data on a 10-core (physical cores) machine is as below.
Threads# 1 thread 2 threads 4 threads 8 threads 10 threads
Time(ms) 78 41 28 21 21
Note:
It seems the memory bandwidth becomes the bottleneck when there are more than 8 threads.
In this case, how to further improve the performance? Is there any design issues in my parallel Java program?
To examine the cause of this scalability issue, I inserted a (meaningless computation) loop into the process(inputs[i])
method. Here is the new data:
Threads# 1 thread 10 threads
Time(ms) 41000 4330
The new data shows good scalability for 10 threads, which in return confirms the original (without meaningless loop) has memory issue, so that its scalability is limited to 8 threads.
But anyway to circumvent this issue, like pre-loading the data into each core's local cache, or loading in batch?
Upvotes: 1
Views: 1054
Reputation: 116818
I find it unlikely that you have a memory bandwidth issue here. It is more likely that your run times are so short that as you approach 0 you are just mostly timing the thread startup/shutdown or the hotswap compiler optimization cycles. Gaining relevant timing information from a Java task that runs so short is close to worthless. The hotswap compiler and other optimizations that run initially often dominate the CPU usage early on in a class' life. Our production applications stabilize only after minutes of live service operation.
If you can significantly increase your run times by adding more input data or by calculating the same result over and over you may get a better idea about what the optimal thread numbers are.
Edit:
Now that you have added timings for 1 and 10 threads over a longer period, it looks to me that you are not bound by anything since the timing seems to be fairly linear -- with some thread overhead. 41000/10 = 4100 versus 4330 for 10 threads.
Pretty good demonstration of what threading can do to a CPU bound application. :-)
Upvotes: 6
Reputation:
How many logical cores do you have?
Consider - imagine you had one core and a hundred threads. The work to be done is the same, it cannot be distributed over multiple cores, but now you have a great deal of thread switching overhead.
Now imagine you have say four cores and four threads. Assuming no other bottlenecks, compute time is quartered.
Now imagine you have four cores and eight threads. You compute time will be approximately quartered, but you'll have added some thread swapping overhead.
Be aware of hyperthreading and that it may help or hinder you, depending on the nature of the compute task.
Upvotes: 2
Reputation: 20320
I'd say your losses are down to switching threads. You have more threads than cores, and none need to block for slower processes, so they are getting switched in, doing a bit of work and then gettimg switched out to switch another one in. Switching threads is an expensive process, given the nature of what you appear to be doing I would have instinctively restricted the number of threads to 8 (leave two cores for the os) , and your performance numbers appear to bear me out.
Upvotes: 0