Reputation: 1091
I am trying to implement matrix multiplication with multiple threads. Everything seems to work correctly, however, it work much slower than the usual algorithm. Here is my code
public class Main {
private static int nRows = 500; //number of rows and columns in matrices
private static int[][] matrix1 = new int[nRows][nRows]; //first matrix for multiplication
private static int[][] matrix2 = new int[nRows][nRows]; //second matrix for multiplication
private static int[][] result1 = new int[nRows][nRows]; //result from linear matrix multiplication
private static int[][] result2 = new int[nRows][nRows]; //result from parallel matrix multiplication
private static Thread[][] pool = new Thread[nRows][nRows]; //array of threads
//method used for transposing a matrix to get its column easily
public static int[][] transpose(int[][] matrix) {
int[][] newMatrix = new int[matrix[0].length][matrix.length];
for (int i = 0; i < matrix[0].length; i++) {
for (int j = 0; j < matrix.length; j++) {
newMatrix[i][j] = matrix[j][i];
}
}
return newMatrix;
}
public static void main(String[] args) {
//initializing input matrices (setting all elements = 1)
for (int i = 0; i < nRows; i++) {
for (int j = 0; j < nRows; j++) {
matrix1[i][j] = 1;
matrix2[i][j] = 1;
}
}
long start;
long end;
System.out.println("Linear algorithm");
start = System.currentTimeMillis();
//linear multiplication algorithm
for (int i = 0; i < nRows; i++) {
for (int j = 0; j < nRows; j++) {
int temp = 0;
for (int k = 0; k < nRows; k++) {
temp += matrix1[i][k] * matrix2[k][j];
}
result1[i][j] = temp;
}
}
//show result
// for(int i=0;i<nRows;i++){
// for(int j=0;j<nRows;j++){
// System.out.print(result1[i][j] + " ");
// }
// System.out.println();
// }
end = System.currentTimeMillis();
System.out.println("Time with linear algorithm: " + (end - start));
//--------------------
System.out.println("Parallel algorithm");
start = System.currentTimeMillis();
int[][] matrix3 = transpose(matrix2); //get a transpose copy of second matrix
for (int i = 0; i < nRows; i++) {
for (int j = 0; j < nRows; j++) {
pool[i][j] = new myThread(matrix1[i], matrix3[j], i, j); //creating a thread for each element
pool[i][j].start(); //starting a thread
}
}
for (int i = 0; i < nRows; i++) {
for (int j = 0; j < nRows; j++) {
try {
pool[i][j].join(); //waiting for the thread to finish its job
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
//show the result
// for(int i=0;i<nRows;i++){
// for(int j=0;j<nRows;j++){
// System.out.print(result2[i][j] + " ");
// }
// System.out.println();
// }
end = System.currentTimeMillis();
System.out.println("Time with parallel algorithm: " + (end - start));
}
//class, where parallel multiplication is implemented
private static class myThread extends Thread {
private int[] row = new int[nRows]; //row for multiplication
private int[] col = new int[nRows]; //column for multiplication
private int i; //row index of the element in resulting matrix
private int j; //column index of the element in resulting matrix
//constructor
public myThread(int[] r, int[] c, int i, int j) {
row = r;
col = c;
this.i = i;
this.j = j;
}
public void run() {
int temp = 0;
for (int k = 0; k < nRows; k++) {
temp += row[k] * col[k]; //getting the element by multiplying row and column of two matrices
}
result2[i][j] = temp; //writing the resulting element to the resulting matrix
}
}
}
Here, I create a new thread for each element in the resulting matrix. I than write these threads to an array, start them and, finally, wait for them to finish working. I've seen some realizations, where the whole input matrix (both of them) would be given as parameters to the thread. My task is, however, to come up with an algorithm, where only one row and one column (that are necessary for this particular element) are given.
After measuring the time elapsed I get following results
Linear algorithm
Time with linear algorithm: 557
Parallel algorithm
Time with parallel algorithm: 38262
What am I doing wrong? Thanks in advance!
Upvotes: 0
Views: 1347
Reputation: 1904
Parallel processing only works well with copious processors. If you don't have sufficient processors to split the work into, and sufficient processors to handle the load, then parallelizing is not going to knock your socks off. Parallelizing may make processing slower.
If you have P processors and 8(P) concurrent requests, then using one thread per request is often more efficient for throughput. The break-even point for decomposition is somewhere just greater than 8(P), depending on the application. The logic here is simple. If you have P processors available and you split your work accordingly but there are hundreds of other tasks ahead of you, then what's the point of splitting? It may be faster to process each request sequentially.
Matrix multiplication is a real memory hog. Without sufficient memory, parallelization may make processing slower.
That said a good way to split and join is too lengthy to duplicate here. I maintain an open-source divide-and-conquer product that has matrix multiply as one of its built-in functions. Have a look and do what you may.
Upvotes: 0
Reputation: 74485
The code you've written will work fine on a GPU where the concept of threads is very different and the overhead is basically zero. On CPU-based systems, spawning threads is an exceptionally slow operation and it only makes sense if you can amortise this overhead over a lot of computational work.
Here is some general advice that will help you write better parallel algorithms for CPUs:
result2
fall in the same cache line (cache lines on x86 and ARM are 64 bytes long, int
is 4 bytes) and are written by 16 different threads. The use of a temporary summation variable alleviates this problem somehow--it is much more severe when false sharing happens repeatedly in an inner(-most) loop.In summary, start as many threads as physical cores and have them work on big contiguous chunks of the input matrices.
Upvotes: 0