Reputation: 1270
Consider the following operation, where I take 20 x 20 slices of a larger matrix and dot product them with another 20 x 20 matrix:
import numpy as np
a = np.random.rand(10, 20)
b = np.random.rand(20, 1000)
ans_list = []
for i in range(980):
ans_list.append(
np.dot(a, b[:, i:i+20])
)
I know that NumPy parallelizes the actual matrix multiplication, but how do I parallelize the outer for loop so that the individual multiplications are run at the same time instead of sequentially?
Additionally, how would I go about it if I wanted to do the same using a GPU? Obviously, I'll use CuPy instead of NumPy, but how do I submit the multiple matrix multiplications to the GPU either simultaneously or asynchronously?
PS: Please note that the sliding windows above are an example to generate multiple matmuls. I know one solution (shown below) in this particular case is to use NumPy built-in sliding windows functionality, but I'm interested in knowing the optimal way to run an arbitrary set of matmuls in parallel (optionally on a GPU), and not just a faster solution for this particular example.
windows = np.lib.stride_tricks.sliding_window_view(b, (20, 20)).squeeze()
ans_list = np.dot(a, windows)
Upvotes: -2
Views: 125
Reputation: 21
CPU:
import numpy as np
from numpy.lib.stride_tricks import as_strided
def timer(func):
def func_wrapper(*args, **kwargs):
from time import time
time_start = time()
result = func(*args, **kwargs)
time_end = time()
time_spend = time_end - time_start
print('%s cost time: %.3f s' % (func.__name__, time_spend))
return result
return func_wrapper
@timer
def parallel_version(a, b):
strides = (b.strides[1],) + b.strides
b_block = as_strided(b, shape=(b.shape[-1] - 20 + 1, 20, 20), strides=strides)
a = a[np.newaxis, :, :].repeat(b.shape[-1] - 20 + 1, axis=0)
return np.matmul(a, b_block)
@timer
def sequential_version(a, b):
ans_list = []
for i in range(b.shape[-1] - 20 + 1):
ans_list.append(
np.dot(a, b[:, i:i + 20])
)
return np.array(ans_list)
if __name__ == '__main__':
a = np.random.rand(10, 20)
b = np.random.rand(20, 1000)
x = parallel_version(a, b)
y = sequential_version(a, b)
print(np.allclose(x, y))
output:
parallel_version cost time: 0.000 s
sequential_version cost time: 0.002 s
True
this code is simple. as_strided
is the key.
Upvotes: 0
Reputation: 79188
I would propose to use sliding_window
:
import numpy as np
from numpy.lib.stride_tricks sliding_window_view
def timer(func):
def func_wrapper(*args, **kwargs):
from time import time
time_start = time()
result = func(*args, **kwargs)
time_end = time()
time_spend = time_end - time_start
print('%s cost time: %.3f s' % (func.__name__, time_spend))
return result
return func_wrapper
@timer
def sequential_version(a, b):
ans_list = []
n = b.shape[0]
for i in range(b.shape[1] - n + 1):
ans_list.append(
np.dot(a, b[:, i:i+n])
)
return np.array(ans_list)
@timer
def parallel2(a, b):
i1 = sliding_window_view(np.arange(b.shape[1]), b.shape[0])
return (a @ b)[:,i1].transpose(1, 0, 2)
a = np.random.rand(10, 20)
b = np.random.rand(20, 1000)
np.isclose(sequential_version(a,b), parallel2(a,b), 1e-10, 1e-10).all()
sequential_version cost time: 0.007 s
parallel2 cost time: 0.000 s
True
Note that the two methods produce the same results.
Try bigger matrices:
a = np.random.rand(100, 200)
b = np.random.rand(200, 1000)
np.isclose(sequential_version(a,b), parallel2(a,b), 1e-10, 1e-10).all()
sequential_version cost time: 0.259 s
parallel2 cost time: 0.031 s
True
Upvotes: -1