Alexis Rosuel
Alexis Rosuel

Reputation: 593

numpy - einsum vs naive implementation runtime performaned

I have a two dimensional array Y of size (N,M), say for instance:

N, M = 200, 100
Y = np.random.normal(0,1,(N,M))

For each N, I want to compute the dot product of the vector (M,1) with its transpose, which returns a (M,M) matrix. One way to do it inefficiently is:

Y = Y[:,:,np.newaxis]
[Y[i,:,:] @ Y[i,:,:].T for i in range(N)]

which is quite slow: timeit on the second line returns

11.7 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) 

I thought a much better way to do it is the use the einsum numpy function (https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html):

np.einsum('ijk,imk->ijm', Y, Y, optimize=True)

(which means: for each row i, create a (j,k) matrix where its elements results from the dot product on the last dimension m)

The two methods does returns the exact same result, but the runtime of this new version is disappointing (only a bit more than twice the speed)

3.82 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

One would expect much more improvement by using the vectorized einsum function since the first method is very inefficient... Do you have an explanation for this ? Does there exists a better way to do this calculation ?

Upvotes: 0

Views: 345

Answers (2)

FBruzzesi
FBruzzesi

Reputation: 6485

This is an old post, yet covers the subject in many details: efficient outer product.

In particular if you are interested in adding numba dependency, that may be your fastest option.

Updating part of numba code from the original post and adding the multi outer product:

import numpy as np
from numba import jit
from numba.typed import List

@jit(nopython=True)
def outer_numba(a, b):
    m = a.shape[0]
    n = b.shape[0]
    result = np.empty((m, n))
    for i in range(m):
        for j in range(n):
            result[i, j] = a[i]*b[j]
    return result

@jit(nopython=True)
def multi_outer_numba(Y):

    all_result = List()
    for k in range(Y.shape[0]):
        y = Y[k]           
        n = y.shape[0]
        tmp_res = np.empty((n, n))
        for i in range(n):
            for j in range(n):
                tmp_res[i, j] = y[i]*y[j]
        all_result.append(tmp_res)

    return all_result

r = [outer_numba(Y[i],Y[i]) for i in range(N)]
r = multi_outer_numba(Y)

Upvotes: 0

hpaulj
hpaulj

Reputation: 231395

In [60]: N, M = 200, 100 
    ...: Y = np.random.normal(0,1,(N,M))                                                                             
In [61]: Y1 = Y[:,:,None]                                                                                            

Your iteration, 200 steps to produce (100,100) arrays:

In [62]: timeit [Y1[i,:,:]@Y1[i,:,:].T for i in range(N)]                                                            
18.5 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

einsum only modestly faster:

In [64]: timeit np.einsum('ijk,imk->ijm', Y1,Y1)                                                                     
14.5 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

but you could apply the @ in full 'batch' mode with:

In [65]: timeit Y[:,:,None]@Y[:,None,:]                                                                              
7.63 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But as Divakar notes, the sum axis is size 1, so you could use plain broadcasted multiply. This is an outer product, not a matrix one.

In [66]: timeit Y[:,:,None]*Y[:,None,:]                                                                              
8.2 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

'vectorizing' gives big gains when doing many iterations on a simple operation. For fewer operations on a more complex operation, the gain isn't as great.

Upvotes: 1

Related Questions