Reputation: 13
So, I am practicing multi threading in java and trying to add the elements of a randomly generated 2D integer array both sequentially and using 4 threads. I measured the performance of my code and for some reason sequential part is a lot faster than multi-threaded. Here is the code for sequential addition:
public class ArraySum2DNonMT {
private int[][] arrayToSum;
private int totalSum;
public ArraySum2DNonMT(int[][] arr){
this.arrayToSum = arr;
this.setTotalSum(0);
}
public void runSequential(){
for(int i = 0; i < arrayToSum[0].length; i++){
for(int j = 0; j < arrayToSum.length; j++){
setTotalSum(getTotalSum() + arrayToSum[j][i]);
}
}
}
public int getTotalSum() {
return totalSum;
}
public void setTotalSum(int totalSum) {
this.totalSum = totalSum;
}
}
Here is the code for Multi-threaded version:
package multiThreaded;
/**
*
* @author Sahil Gupta
*
* This class takes in a 2D integer array and adds it's contents. This
* addition will be concurrent between several threads which will divide
* the work of the array based on the threadID assigned to thread by the
* programmer. Assume that the passed in 2D array to the constructor is a
* matrix with each array in the main array having same length.
*/
public class ArraySum2D implements Runnable{
private int[][] arrayToSum;
private int threadID;
private int totalSum;
public ArraySum2D(int[][] arr, int threadID){
this.arrayToSum = arr;
this.threadID = threadID;
this.setTotalSum(0);
}
@Override
public void run() {
int arrayCol = arrayToSum[0].length;
int arrayRow = arrayToSum.length;
int colStart = (int)((threadID%2) * (arrayCol/2));
int rowStart = (int)((int)(threadID/2) * (arrayRow/2));
int colEnd = colStart + (int)(arrayCol/2);
int rowEnd = rowStart + (int)(arrayRow/2);
for(int i = colStart; i < colEnd; i++){
for(int j = rowStart; j < rowEnd; j++){
setTotalSum(getTotalSum() + arrayToSum[j][i]);
}
}
}
public int getTotalSum() {
return totalSum;
}
public void setTotalSum(int totalSum) {
this.totalSum = totalSum;
}
}
Here is the main:
package controller;
import java.util.Random;
import multiThreaded.ArraySum2D;
import sequentialNonMT.ArraySum2DNonMT;
public class ControllerMain {
private final static int cols = 20;
private final static int rows = 10;
private static volatile int[][] arrayToAdd = new int[rows][cols];
private static Random rand = new Random();
private static ArraySum2D a0, a1, a2, a3;
public static void main(String[] args) throws InterruptedException{
for(int j = 0; j < rows; j++){
for(int i = 0; i < cols; i++){
arrayToAdd[j][i] = rand.nextInt(100);
}
}
ArraySum2DNonMT a = new ArraySum2DNonMT(arrayToAdd);
long startTimeSequential = System.nanoTime();
a.runSequential();
long estimatedTimeSequential = System.nanoTime() - startTimeSequential;
System.out.println("The total sum calculated by sequential program is: " + a.getTotalSum());
System.out.println("The total time taken by sequential program is: " + estimatedTimeSequential);
a0 = new ArraySum2D(arrayToAdd, 0);
a1 = new ArraySum2D(arrayToAdd, 1);
a2 = new ArraySum2D(arrayToAdd, 2);
a3 = new ArraySum2D(arrayToAdd, 3);
Thread t0 = new Thread(a0);
Thread t1 = new Thread(a1);
Thread t2 = new Thread(a2);
Thread t3 = new Thread(a3);
long startTimeMultiThreaded = System.nanoTime();
t0.start();
t1.start();
t2.start();
t3.start();
t0.join();
t1.join();
t2.join();
t3.join();
int Sum = addThreadSum();
long estimatedTimeMultiThreaded = System.nanoTime() - startTimeMultiThreaded;
System.out.println("The total sum calculated by multi threaded program is: " + Sum);
System.out.println("The total time taken by multi threaded program is: " + estimatedTimeMultiThreaded);
}
private static int addThreadSum(){
return a0.getTotalSum() + a1.getTotalSum() + a2.getTotalSum() + a3.getTotalSum();
}
}
The output I am currently getting shows significant difference in runtime (measured here in nanoseconds). This is what I get:
The total sum calculated by sequential program is: 10109
The total time taken by sequential program is: 46000
The total sum calculated by multi threaded program is: 10109
The total time taken by multi threaded program is: 641000
The sequential code is around 13 times faster. Can you please help me point out what I might not be doing correctly? I have a dual core i7 haswell, macbook air. I am not sure why it is taking longer, but here are a few ideas I have in mind which might be causing this: False sharing, Too much parallelism/threads (4 for dual core), Cache coherence protocol might not be favoring me, some other basic multi threaded stuff I am missing/don't know about.
Please help me identify the specific cause and ways I can make multi threaded run faster than sequential. Thank you so much for helping me out!
Edit: More info about the processor and it's caches: Processor Name: Intel Core i7 Processor Speed: 1.7 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 4 MB
I think this can have up to 4 threads according to Intel's data sheet.
P.S. This is my first post to ask question but I have been using this site to clear doubts for a while now. Forgive me for any mistakes I make.
Upvotes: 1
Views: 7166
Reputation: 2122
I totally agree with Makoto, as far as why you see the multithreaded program being slower: thread creation overhead dwarfs the tiny calculation time for such a small array, as you can see by bumping up the array sizes.
Although this doesn't directly answer your question, I thought you might find this interesting, as there is a trivial change that can make both versions much faster.
Consider your original code:
for(int i = 0; i < arrayToSum[0].length; i++){
for(int j = 0; j < arrayToSum.length; j++){
setTotalSum(getTotalSum() + arrayToSum[j][i]);
}
}
In java, there's really no such thing as 2D arrays; instead a 2D array is really an array of array references under the covers. (Good reference: http://www.willamette.edu/~gorr/classes/cs231/lectures/chapter9/arrays2d.htm) In your original for loop, the inner (j
) loop is walking across all the arrays, fetching one element at a time. (i.e. add the first element of all the arrays, then the second, etc.) This behavior on a large array pretty much guarantees that you get no help from your memory cache, as your code has very poor locality of reference.
If you swap the order of iteration, like so:
for(int j = 0; j < arrayToSum.length; j++){ // <-- this used to be the inner loop
for(int i = 0; i < arrayToSum[0].length; i++){
setTotalSum(getTotalSum() + arrayToSum[j][i]);
}
}
Now your inner loop is walking sequentially through one array at a time, and you have excellent locality of references, which is very cache-friendly.
On my machine, this second version runs nearly 20 times as fast as the first version. (And, in case you are interested, on my machine, the multi-threaded version runs about 2.8x as fast as the single-threaded version, so when you combine the two changes, the cache-friendly, multi-threaded array sum operation runs nearly 60x as fast as the original, single-threaded, cache-hostile :) version.
Upvotes: 1
Reputation: 106400
There is a sizable amount of overhead when establishing threads. That said, if your sample data set is too small, the amount of time spent spinning up and tearing down the threads will be greater than the actual running time performance of your code.
Let's look at it subjectively. You have an array that contains only 200 elements. Your method's runtime is O(nm), where n is the row size, and m is the column size.
To be frank, the only kind of machine I expect not to churn through 200 elements quickly in this fashion would be my old Pentium III machine. And even then it wouldn't be that far off.
I've got a relatively beefy i7-4770K, which can do 4 cores with two threads per core. If I run your program with those lower numbers, I get about the same results.
But...what if I set my bounds a little larger? Let n = 2**m*, and let n = 9000.
Don't pay any attention to the sums. Integer overflow has totally wrecked any value we're getting out of it.
The total sum calculated by sequential program is: -570429863
The total time taken by sequential program is: 3369190200
The total sum calculated by multi threaded program is: -570429863
The total time taken by multi threaded program is: 934624554
The threaded version ran in 27% of the time, or about 3.6 times faster. Or in layman's terms, 3.36 seconds vs. 934ms. That's huge.
Threading didn't change the performance of the algorithm - it's still horribly inefficient at O(nm) - but it did change the runtime constant, such that it's not quite, but close to 1/4 the time. The only reason I was able to glean an advantage from this was due to the size of data I pushed through it. Otherwise, threading just isn't worth it.
Upvotes: 2