kdduha
kdduha

Reputation: 11

Why is Strassen's algorithm slower than the usual matrix multiplication?

I'm trying to figure out how to multpiply matrixes really fast on Python without using NumPy. By this reason, I've recreated the Strassen algorithm and compared it with the standard multiplication of loops. Also, I compare only square matrices of size NxN where N is 2^k. Surprisingly, my Strassen algoritm is 3.5 times slower than standart one (I use from time import perf_counter() for tracking time). As an example, the average values are approximately as follows:

matrix size Strassen standart mult
16x16 0.006 sec 0.002 sec
32x32 0.036 sec 0.013 sec
64x64 0.26 sec 0.07 sec
128x128 1.69 sec 0.49 sec
1024x1024 771.42 sec 221.09 sec

I generated matrix elements with randint(1, 9) but anyway I tracked time only after creating matrixes for testing and stopped doing it before writing matrix result using loops. So I only tracked time of my functions. And yeah I've seen some other posts with the same problem like this one:

Why is Strassen matrix multiplication so much slower than standard matrix multiplication?

but actually I can't say that it was pretty helpful for me. Also, It would be really cool just to change smth in my current algorithm for optimizing instead of creating new one.

Stressen algorithm

t_start = perf_counter()

def submatrices(n, matrix): # dividing matrix for pieces
    A = [[j for j in matrix[i][:int(n / 2)]] for i in range(int(n / 2))]
    B = [[j for j in matrix[i][int(n / 2):]] for i in range(int(n / 2))]
    C = [[j for j in matrix[i][:int(n / 2)]] for i in range(int(n / 2), n)]
    D = [[j for j in matrix[i][int(n / 2):]] for i in range(int(n / 2), n)]
    return [A, B, C, D]


def addition(n, matrix1, matrix2):  # just addition
    res = [[matrix1[i][j] + matrix2[i][j] for j in range(n)] for i in range(n)]
    return res


def subtraction(n, matrix1, matrix2):   # just substraction
    res = [[matrix1[i][j] - matrix2[i][j] for j in range(n)] for i in range(n)]
    return res


def strassen(n, matrix1, matrix2):

    if n == 2:  # the last step of algorithm is just standart multiplycation
        xy = [[0] * n for i in range(n)]
        for i in range(n):
            for j in range(n):
                for x in range(n):
                    xy[i][j] += matrix1[i][x] * matrix2[x][j]
    else:
        A, B, C, D = submatrices(n, matrix1)    # divide the original matrix1
        E, F, G, H = submatrices(n, matrix2)    # divide the original matrix2

        n = int(n / 2)  # the matrix size is changed now

        p1 = strassen(n, A, subtraction(n, F, H))
        p2 = strassen(n, addition(n, A, B), H)
        p3 = strassen(n, addition(n, C, D), E)
        p4 = strassen(n, D, subtraction(n, G, E))
        p5 = strassen(n, addition(n, A, D), addition(n, E, H))
        p6 = strassen(n, subtraction(n, B, D), addition(n, G, H))
        p7 = strassen(n, subtraction(n, A, C), addition(n, E, F))

        xy1 = addition(n, addition(n, p5, p6), subtraction(n, p4, p2))  # making new blocks of matrix
        xy2 = addition(n, p1, p2)
        xy3 = addition(n, p3, p4)
        xy4 = subtraction(n, addition(n, p1, p5), addition(n, p3, p7))

        xy = [xy1[i] + xy2[i] for i in range(n)] + [xy3[i] + xy4[i] for i in range(n)]  # assembling a matrix of blocks
    return xy

print(f'Time: {perf_counter() - t_start} sec')

# printing result
for raw in strassen(n, matrix1, matrix2):
    print(*raw)

Standart multiplication

t_start = perf_counter()


def multiply(n, matrix1, matrix2):
    res = [[0]*n for i in range(n)]
    for i in range(n):
        for j in range(n):
            for x in range(n):
                res[i][j] += matrix1[i][x] * matrix2[x][j]

    return res

print(f'Time: {perf_counter() - t_start} sec')

for raw in multiply(n, matrix1, matrix2):
    print(*raw)

Upvotes: 1

Views: 994

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50816

Both are horribly inefficient (certainly at least 3-4 order of magnitude slower than optimized implementations). The standard implementation of Python is the CPython interpreter which is clearly not design for such a kind of computation. It is mainly meant to execute glue code calling C functions like the one of BLAS libraries. In practice, accessing a list with lst[i][j] cause many functions calls, memory indirections, object allocation/destruction, etc. All these overheads are pretty huge compare to the same since in a native compiled language (like C/C++) and they are also hard to track without a good understanding of the interpreter (another Python implementation will certainly result in completely different results). One issue with the Strassen implementation is recursion: recursion is pretty slow with CPython (you can write a naive Fibonacci recursive implementation so to easily check that). Another big issue is the creation of the many temporary sub list of lists which should be very slow too (a lot of objects need to be allocated).

If you want to compare such computationally intensive algorithm, please use a natively compiled language. The outcome of such a comparison in Python will be irrelevant in other languages and very dependent of the details of the target implementation of Python. Note that Strassen tends to be efficient only for huge matrices (generally much bigger than 1024x1024).

Besides, a first step could be to use Numpy so to create views of sub-matrices with a much smaller overhead. The operation on matrice should also be far faster and even also simpler to implement.

Upvotes: 1

Related Questions