Reputation: 1267
Hello Stack Overflow community,
I'm working with NumPy for matrix operations and I have a question regarding how NumPy handles matrix multiplication, especially when dealing with non-continuous slices of matrices.
Consider a scenario where we have a large matrix, say of size [1000, 1000], and we need to perform a matrix multiplication on a sliced version of this matrix with steps, such as [::10, ::10]. I understand that NumPy likely uses optimized BLAS routines like GEMM
for matrix multiplication under the hood. However, BLAS routines generally require contiguous memory layouts to function efficiently.
My question is: How does NumPy internally handle such scenarios where the input matrices for multiplication are non-contiguous due to slicing with steps? Specifically, I'm interested in understanding if NumPy:
GEMM
.This information will help me better understand the performance implications of using slices with steps in matrix multiplications in NumPy.
Thank you in advance for your insights!
Upvotes: 1
Views: 133
Reputation: 16204
np.matmul
does quite a bit of work trying to figure out when it can pass off work to BLAS. The main source file implementing it is numpy/_core/src/umath/matmul.c.src
, specifically have a look at @TYPE@_matmul()
and is_blasable2d()
.
Specifically, the comment on is_blasable2d
checks that:
- Strides must not alias or overlap
- The faster (second) axis must be contiguous
- The slower (first) axis stride, in unit steps, must be larger than the faster axis dimension
So your example will should use the slower _noblas
variants due to that second constraint, namely that the second axis is not contiguous.
As a sanity check, we see if runtimes are consistent with the above observations:
import numpy as np
arr = np.zeros((1000, 1000))
%timeit arr[::2,::2] @ arr[::2,::2] # takes ~300ms
%timeit arr[::2,:500] @ arr[::2,:500] # takes ~ 4ms
%timeit arr[:500,:500] @ arr[:500,:500] # takes ~ 4ms
# as pointed out by hpaulj, the following takes ~ 5ms
%timeit np.ascontiguousarray(arr[::2,::2]) @ np.ascontiguousarray(arr[::2,::2])
Which seems correct, the first variant has a non-contiguous second axis and is much slower, presumably because it's not using BLAS. The other variants are presumably faster because they're being passed to BLAS. Making a contiguous copy takes some time, but the resulting runtime is faster so it looks worthwhile to do this when necessary.
Upvotes: 1