Multi threaded matrix multiplication performance issue

Question

I am using java for multi threaded multiplication. I am practicing multi threaded programming. Following is the code that I took from another post of stackoverflow.

public class MatMulConcur {

private final static int NUM_OF_THREAD =1 ;
private static Mat matC;

public static Mat matmul(Mat matA, Mat matB) {
matC = new Mat(matA.getNRows(),matB.getNColumns());
return mul(matA,matB);
}

private static Mat mul(Mat matA,Mat matB) {

int numRowForThread;
int numRowA = matA.getNRows();
int startRow = 0;

Worker[] myWorker = new Worker[NUM_OF_THREAD];

for (int j = 0; j < NUM_OF_THREAD; j++) {
    if (j



I ran this program for 1,10,20,...,100 threads but performance is decreasing instead. Following is the time table


Thread 1 takes 18 Milliseconds
Thread 10 takes 18 Milliseconds
Thread 20 takes 35 Milliseconds 
Thread 30 takes 38 Milliseconds
Thread 40 takes 43 Milliseconds 
Thread 50 takes 48 Milliseconds
Thread 60 takes 57 Milliseconds
Thread 70 takes 66 Milliseconds
Thread 80 takes 74 Milliseconds 
Thread 90 takes 87 Milliseconds
Thread 100 takes 98 Milliseconds


Any Idea?

Stephen C · Accepted Answer

People think that using multiple threads will automatically (magically!) make any computation go faster. This is not so¹.

There are a number of factors that can make multi-threading speedup less than you expect, or indeed result in a slowdown.

A computer with N cores (or hyperthreads) can do computations at most N times as fast as a computer with 1 core. This means that when you have T threads where T > N, the computational performance will be capped at N. (Beyond that, the threads make progress because of time slicing.)
A computer has a certain amount of memory bandwidth; i.e. it can only perform a certain number of read/write operations per second on main memory. If you have an application where the demand exceeds what the memory subsystem can achieve, it will stall (for a few nanoseconds). If there are many cores executing many threads at the same time, then it is the aggregate demand that matters.
A typical multi-threaded application working on shared variables or data structures will either use volatile or explicit synchronization to do this. Both of these increase the demand on the memory system.
When explicit synchronization is used and two threads want to hold a lock at the same time, one of them will be blocked. This lock contention slows down the computation. Indeed, the computation is likely to be slowed down if there was past contention on the lock.
Thread creation is expensive. Even acquiring an existing thread from a thread pool can be relatively expensive. If the task that you perform with the thread is too small, the setup costs can outweigh the possible speedup.

There is also the issue that you may be running into problems with a poorly written benchmark; e.g. the JVM may not be properly warmed up before taking the timing measurements.

There is insufficient detail in your question to be sure which of the above factors is likely to affect your application's performance. But it is likely to be a combination of 1 2 and 5 ... depending on how many cores are used, how big the CPUs memory caches are, how big the matrix is, and other factors.

^{1 - Indeed, if this was true then we would not need to buy computers with lots of cores. We could just use more and more threads. Provided you had enough memory, you could do an infinite amount of computation on a single machine. Bitcoin mining would be a doddle. Of course, it isn't true.}

Multi threaded matrix multiplication performance issue

Answers (2)

Related Questions