musako
musako

Reputation: 1267

How Does NumPy Internally Handle Matrix Multiplication with Non-continuous Slices?

Hello Stack Overflow community,

I'm working with NumPy for matrix operations and I have a question regarding how NumPy handles matrix multiplication, especially when dealing with non-continuous slices of matrices.

Consider a scenario where we have a large matrix, say of size [1000, 1000], and we need to perform a matrix multiplication on a sliced version of this matrix with steps, such as [::10, ::10]. I understand that NumPy likely uses optimized BLAS routines like GEMM for matrix multiplication under the hood. However, BLAS routines generally require contiguous memory layouts to function efficiently.

My question is: How does NumPy internally handle such scenarios where the input matrices for multiplication are non-contiguous due to slicing with steps? Specifically, I'm interested in understanding if NumPy:

  1. Automatically reallocates these slices to a new contiguous memory block and then performs GEMM.
  2. Has an optimized way to handle non-continuous slices without reallocating memory.
  3. Uses any specific variant of BLAS routines or NumPy's own implementation to handle such cases.

This information will help me better understand the performance implications of using slices with steps in matrix multiplications in NumPy.

Thank you in advance for your insights!

Upvotes: 1

Views: 133

Answers (1)

Sam Mason
Sam Mason

Reputation: 16204

np.matmul does quite a bit of work trying to figure out when it can pass off work to BLAS. The main source file implementing it is numpy/_core/src/umath/matmul.c.src, specifically have a look at @TYPE@_matmul() and is_blasable2d().

Specifically, the comment on is_blasable2d checks that:

  1. Strides must not alias or overlap
  2. The faster (second) axis must be contiguous
  3. The slower (first) axis stride, in unit steps, must be larger than the faster axis dimension

So your example will should use the slower _noblas variants due to that second constraint, namely that the second axis is not contiguous.

As a sanity check, we see if runtimes are consistent with the above observations:

import numpy as np

arr = np.zeros((1000, 1000))

%timeit arr[::2,::2] @ arr[::2,::2]     # takes ~300ms
%timeit arr[::2,:500] @ arr[::2,:500]   # takes ~  4ms
%timeit arr[:500,:500] @ arr[:500,:500] # takes ~  4ms

# as pointed out by hpaulj, the following takes ~  5ms
%timeit np.ascontiguousarray(arr[::2,::2]) @ np.ascontiguousarray(arr[::2,::2])

Which seems correct, the first variant has a non-contiguous second axis and is much slower, presumably because it's not using BLAS. The other variants are presumably faster because they're being passed to BLAS. Making a contiguous copy takes some time, but the resulting runtime is faster so it looks worthwhile to do this when necessary.

Upvotes: 1

Related Questions