anonymous1a
anonymous1a

Reputation: 1270

How do I parallelize a set of matrix multiplications

Consider the following operation, where I take 20 x 20 slices of a larger matrix and dot product them with another 20 x 20 matrix:

import numpy as np

a = np.random.rand(10, 20)
b = np.random.rand(20, 1000)

ans_list = []

for i in range(980):
    ans_list.append(
        np.dot(a, b[:, i:i+20])
    )

I know that NumPy parallelizes the actual matrix multiplication, but how do I parallelize the outer for loop so that the individual multiplications are run at the same time instead of sequentially?

Additionally, how would I go about it if I wanted to do the same using a GPU? Obviously, I'll use CuPy instead of NumPy, but how do I submit the multiple matrix multiplications to the GPU either simultaneously or asynchronously?


PS: Please note that the sliding windows above are an example to generate multiple matmuls. I know one solution (shown below) in this particular case is to use NumPy built-in sliding windows functionality, but I'm interested in knowing the optimal way to run an arbitrary set of matmuls in parallel (optionally on a GPU), and not just a faster solution for this particular example.

windows = np.lib.stride_tricks.sliding_window_view(b, (20, 20)).squeeze()
ans_list = np.dot(a, windows)

Upvotes: -2

Views: 125

Answers (2)

chengyongru
chengyongru

Reputation: 21

CPU:

import numpy as np
from numpy.lib.stride_tricks import as_strided


def timer(func):
    def func_wrapper(*args, **kwargs):
        from time import time
        time_start = time()
        result = func(*args, **kwargs)
        time_end = time()
        time_spend = time_end - time_start
        print('%s cost time: %.3f s' % (func.__name__, time_spend))
        return result

    return func_wrapper


@timer
def parallel_version(a, b):
    strides = (b.strides[1],) + b.strides

    b_block = as_strided(b, shape=(b.shape[-1] - 20 + 1, 20, 20), strides=strides)
    a = a[np.newaxis, :, :].repeat(b.shape[-1] - 20 + 1, axis=0)

    return np.matmul(a, b_block)


@timer
def sequential_version(a, b):
    ans_list = []
    for i in range(b.shape[-1] - 20 + 1):
        ans_list.append(
            np.dot(a, b[:, i:i + 20])
        )
    return np.array(ans_list)


if __name__ == '__main__':
    a = np.random.rand(10, 20)
    b = np.random.rand(20, 1000)
    x = parallel_version(a, b)
    y = sequential_version(a, b)
    print(np.allclose(x, y))

output:

parallel_version cost time: 0.000 s
sequential_version cost time: 0.002 s
True

this code is simple. as_strided is the key.

Upvotes: 0

Onyambu
Onyambu

Reputation: 79188

I would propose to use sliding_window:

import numpy as np
from numpy.lib.stride_tricks sliding_window_view


def timer(func):
    def func_wrapper(*args, **kwargs):
        from time import time
        time_start = time()
        result = func(*args, **kwargs)
        time_end = time()
        time_spend = time_end - time_start
        print('%s cost time: %.3f s' % (func.__name__, time_spend))
        return result

    return func_wrapper


@timer
def sequential_version(a, b):
    ans_list = []
    n = b.shape[0]
    for i in range(b.shape[1] - n + 1):
        ans_list.append(
            np.dot(a, b[:, i:i+n])
        )
    return np.array(ans_list)
  
  
@timer
def parallel2(a, b):
  i1 = sliding_window_view(np.arange(b.shape[1]), b.shape[0])
  return (a @ b)[:,i1].transpose(1, 0, 2)


a = np.random.rand(10, 20)
b = np.random.rand(20, 1000)
np.isclose(sequential_version(a,b), parallel2(a,b), 1e-10, 1e-10).all()
sequential_version cost time: 0.007 s
parallel2 cost time: 0.000 s
True

Note that the two methods produce the same results.

Try bigger matrices:

a = np.random.rand(100, 200)
b = np.random.rand(200, 1000)
np.isclose(sequential_version(a,b), parallel2(a,b), 1e-10, 1e-10).all()
sequential_version cost time: 0.259 s
parallel2 cost time: 0.031 s
True

Upvotes: -1

Related Questions