Parallel matrix multiplication in java

Question

I am trying to implement matrix multiplication with multiple threads. Everything seems to work correctly, however, it work much slower than the usual algorithm. Here is my code

public class Main {
    private static int nRows = 500; //number of rows and columns in matrices
    private static int[][] matrix1 = new int[nRows][nRows]; //first matrix for multiplication
    private static int[][] matrix2 = new int[nRows][nRows]; //second matrix for multiplication
    private static int[][] result1 = new int[nRows][nRows]; //result from linear matrix multiplication
    private static int[][] result2 = new int[nRows][nRows]; //result from parallel matrix multiplication

    private static Thread[][] pool = new Thread[nRows][nRows]; //array of threads

    //method used for transposing a matrix to get its column easily
    public static int[][] transpose(int[][] matrix) {
        int[][] newMatrix = new int[matrix[0].length][matrix.length];
        for (int i = 0; i < matrix[0].length; i++) {
            for (int j = 0; j < matrix.length; j++) {
                newMatrix[i][j] = matrix[j][i];
            }
        }
        return newMatrix;
    }

    public static void main(String[] args) {
        //initializing input matrices (setting all elements = 1)
        for (int i = 0; i < nRows; i++) {
            for (int j = 0; j < nRows; j++) {
                matrix1[i][j] = 1;
                matrix2[i][j] = 1;
            }
        }

        long start;
        long end;

        System.out.println("Linear algorithm");
        start = System.currentTimeMillis();

        //linear multiplication algorithm
        for (int i = 0; i < nRows; i++) {
            for (int j = 0; j < nRows; j++) {
                int temp = 0;
                for (int k = 0; k < nRows; k++) {
                    temp += matrix1[i][k] * matrix2[k][j];
                }
                result1[i][j] = temp;
            }
        }

        //show result
//        for(int i=0;i


Here, I create a new thread for each element in the resulting matrix. I than write these threads to an array, start them and, finally, wait for them to finish working. I've seen some realizations, where the whole input matrix (both of them) would be given as parameters to the thread. My task is, however, to come up with an algorithm, where only one row and one column (that are necessary for this particular element) are given.
After measuring the time elapsed I get following results
Linear algorithm
Time with linear algorithm: 557
Parallel algorithm
Time with parallel algorithm: 38262

What am I doing wrong? Thanks in advance!

Hristo Iliev · Accepted Answer

The code you've written will work fine on a GPU where the concept of threads is very different and the overhead is basically zero. On CPU-based systems, spawning threads is an exceptionally slow operation and it only makes sense if you can amortise this overhead over a lot of computational work.

Here is some general advice that will help you write better parallel algorithms for CPUs:

With computationally heavy tasks, use as many threads as there are physical execution units (cores). SMT techniques such as HyperThreading do not help much unless there is a lot of memory latency. For small matrices that fit in the L1 and L2 CPU caches, the latency is very low and there is nothing to gain from SMT. When more than one thread share the same core, the OS has to context switch between the two, which adds overhead and may trash the cache.
Keep parallelisation granularity as coarse as possible so to maximise the work per thread. Instead of having one row x column operation per thread, have each thread operate on contiguous blocks of rows / columns. You can try and only parallelise the outer loop, i.e., only over the rows of the first matrix.
Keep the number of threads dependent on the hardware properties (number of cores) and independent of the problem size. Spawning a separate thread for each row and column scales the overhead linearly with the problem size, which is really bad from performance point of view.
Avoid false sharing. This happens when two or more threads running on different cores write to memory locations that fall in the same cache line. When one thread updates the cache of its core, the change propagates and invalidates the caches of the other cores that have the same cache line, forcing them to refetch the data. In your case, 16 consecutive values of result2 fall in the same cache line (cache lines on x86 and ARM are 64 bytes long, int is 4 bytes) and are written by 16 different threads. The use of a temporary summation variable alleviates this problem somehow--it is much more severe when false sharing happens repeatedly in an inner(-most) loop.
Use thread pools for repeated tasks when the number of work items exceeds the number of threads and each thread will get work multiple times. In your case, you give each thread a single work item, so this is not really pooling.

In summary, start as many threads as physical cores and have them work on big contiguous chunks of the input matrices.

Parallel matrix multiplication in java

Answers (2)

Related Questions