Reputation: 79
I try to implement the Cannon's algorithm of matrix multiplication. I read description on the wikipedia that provides next pseudocode:
row i of matrix a is circularly shifted by i elements to the left.
col j of matrix b is circularly shifted by j elements up.
Repeat n times:
p[i][j] multiplies its two entries and adds to running total.
circular shift each row of a 1 element left
circular shift each col of b 1 element up
and I implemented it on the C# next way:
public static void ShiftLeft(int[][] matrix, int i, int count)
{
int ind = 0;
while (ind < count)
{
int temp = matrix[i][0];
int indl = matrix[i].Length - 1;
for (int j = 0; j < indl; j++)
matrix[i][j] = matrix[i][j + 1];
matrix[i][indl] = temp;
ind++;
}
}
public static void ShiftUp(int[][] matrix, int j, int count)
{
int ind = 0;
while (ind < count)
{
int temp = matrix[0][j];
int indl = matrix.Length - 1;
for (int i = 0; i < indl; i++)
matrix[i][j] = matrix[i + 1][j];
matrix[indl][j] = temp;
ind++;
}
}
public static int[][] Cannon(int[][] A, int[][] B)
{
int[][] C = new int[A.Length][];
for (int i = 0; i < C.Length; i++)
C[i] = new int[A.Length];
for (int i = 0; i < A.Length; i++)
ShiftLeft(A, i, i);
for (int i = 0; i < B.Length; i++)
ShiftUp(B, i, i);
for (int k = 0; k < A.Length; k++)
{
for (int i = 0; i < A.Length; i++)
{
for (int j = 0; j < B.Length; j++)
{
var m = (i + j + k) % A.Length;
C[i][j] += A[i][m] * B[m][j];
ShiftLeft(A, i, 1);
ShiftUp(B, j, 1);
}
}
};
return C;
}
this code return correct result, but do it very slowly. Much slowly even than naive algorithm of matrix multiplication.
For matrix 200x200 I got that result:
00:00:00.0490432 //naive algorithm
00:00:07.1397479 //Cannon's algorithm
What I am doing wrong?
Thanks SergeySlepov, it was bad attempt to do it parallel. When I back to sequential implementation I got next result:
Count Naive Cannon's
200 00:00:00.0492098 00:00:08.0465076
250 00:00:00.0908136 00:00:22.3891375
300 00:00:00.1477764 00:00:58.0640621
350 00:00:00.2639114 00:01:51.5545524
400 00:00:00.4323984 00:04:50.7260942
okay, it's not a parallel implementation, but how can I do it correctly?
Upvotes: 1
Views: 2905
Reputation: 2111
Cannon's algorithm was built for a 'Distributed Memory Machine' (a grid of processors, each with its own memory). This is very different to the hardware you're running it on (a few processors with shared memory) and that is why you're not seeing any increase in performance.
The 'circular shifts' in the pseudocode that you quoted actually mimic data transfers between processors. After the initial matrix 'skewing', each processor in the grid keeps track of three numbers (a, b and c) and executes pseudocode similar to this:
c += a * b;
pass 'a' to the processor to your left (wrapping around)
pass 'b' to the processor to 'above' you (wrapping around)
wait for the next iteration of k
We could mimic this behaviour on a PC using NxN threads but the overhead of context switching (or spawning Task
s) would kill all the joy. To make the most of a PC's 4 (or so) CPUs we could make the loop over i
parallel. The loop over k
needs to be sequential (unlike your solution), otherwise you might face racing conditions as each iteration of k
modifies the matrices A, B and C. In a 'distributed memory machine' race conditions are not a problem as processors do not share any memory.
Upvotes: 1