Reputation: 11

Using vectorization to get rid of for cycle in matrix multiplication

We have two matrices A and B, where

A.shape # Output T,N
B.shape # Output N,N,T

and we are interested in calculating the mult value, which I am calculating using for cycle right now

mult=0
for t in range(0,T):
    mult +=   np.matmul(np.matmul(A[t,:].T , B[:,:,t]) ,A[t,:])

I have already found a similar problem, which I am confident can be helpful in find the solution Is there a numpy/scipy dot product, calculating only the diagonal entries of the result?, but I am unable to come up with a clean solution without using the for cycle.

Is there a way of using only basic numpy operations to get the desired result?

Upvotes: 1

Answers (1)

Andras Deak -- Слава Україні

Reputation: 35094

It looks like you can get away with a call to np.einsum:

mult = np.einsum('ij, jki, ik ->', A, B, A)

What this does is compute the summation

sum_{i, j, k} A[i, j] * B[j, k, i] * A[i, k]

which is what you need.

Note that in your original code A[t, :].T is a 1d array, so transposition is a no-op, this is just A[t, :].

Also note that np.einsum might be concise, but it's often slower than the corresponding calls to np.dot or elementwise operations. So you might get other answers that end up being faster. I haven't timed this approach.

Case in point: there's a direct multiplication alternative making use of array broadcasting:

mult = (A[:, None, :] * B.T * A[..., None]).sum()

Now, there's a lot of legroom here: we could choose to transpose A instead, or only reorder some of the indices of B and so on. The corresponding performance will depend on your use case, because memory access ends up being highly non-trivial (which impacts caching, which is mostly where vectorisation speedup comes from).

For example, for T = 10 and N = 100 and uniform random matrices we get these timings:

>>> %timeit loopy(A, B)
... %timeit einsummed(A, B)
... %timeit mulled(A, B)

179 µs ± 591 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
304 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
309 µs ± 2.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(loopy() is your code, einsummed() is my first solution and mulled() is my second.) So both einsum and the direct summation end up being almost twice as slow. This is because your loop goes over a "small" index of size 10, where looping often outperforms vectorised solutions.

In contrast, if we choose T = 100 and N = 10 we get

>>> %timeit loopy(A, B)
... %timeit einsummed(A, B)
... %timeit mulled(A, B)
231 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
37.9 µs ± 265 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
38.1 µs ± 181 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Now my two approaches are almost six times faster.

Whether either of these is faster than your loop depends on your real problem. Time them and see.

Upvotes: 3

Using vectorization to get rid of for cycle in matrix multiplication

Answers (1)

Related Questions