performance loss after vectorization in numpy

Question

I am writing a time consuming program. To reduce the time, I have tried my best to use numpy.dot instead of for loops.

However, I found vectorized program to have much worse performance than the for loop version:

import numpy as np
import datetime
kpt_list = np.zeros((10000,20),dtype='float')
rpt_list = np.zeros((1000,20),dtype='float')
h_r = np.zeros((20,20,1000),dtype='complex')
r_ndegen = np.zeros(1000,dtype='float')
r_ndegen.fill(1)
# setup completed
# this is a the vectorized version
r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
start = datetime.datetime.now()
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T))/r_ndegen_tile
kpt_data_1 = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 19.302483
# this is the for loop version
kpt_data_2 = np.zeros((20, 20, 10000), dtype='complex')
start = datetime.datetime.now()
for i in range(10000):
    kpt = kpt_list[i, :]
    phase = np.exp(1j * np.dot(kpt, rpt_list.T))/r_ndegen
    kpt_data_2[:, :, i] = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 7.74583

What is happening here?

ali_m · Accepted Answer

The first thing I suggest you do is break your script down into separate functions to make profiling and debugging easier:

def setup(n1=10000, n2=1000, n3=20, seed=None):

    gen = np.random.RandomState(seed)
    kpt_list = gen.randn(n1, n3).astype(np.float)
    rpt_list = gen.randn(n2, n3).astype(np.float)
    h_r = (gen.randn(n3, n3,n2) + 1j*gen.randn(n3, n3,n2)).astype(np.complex)
    r_ndegen = gen.randn(1000).astype(np.float)

    return kpt_list, rpt_list, h_r, r_ndegen


def original_vec(*args, **kwargs):

    kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)

    r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
    phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
    kpt_data = h_r.dot(phase)

    return kpt_data


def original_loop(*args, **kwargs):

    kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)

    kpt_data = np.zeros((20, 20, 10000), dtype='complex')
    for i in range(10000):
        kpt = kpt_list[i, :]
        phase = np.exp(1j * np.dot(kpt, rpt_list.T)) / r_ndegen
        kpt_data[:, :, i] = h_r.dot(phase)

    return kpt_data

I would also highly recommend using random data rather than all-zero or all-one arrays, unless that's what your actual data looks like (!). This makes it much easier to check the correctness of your code - for example, if your last step is to multiply by a matrix of zeros then your output will always be all-zeros, regardless of whether or not there is a mistake earlier on in your code.

Next, I would run these functions through line_profiler to see where they are spending most of their time. In particular, for original_vec:

In [1]: %lprun -f original_vec original_vec()
Timer unit: 1e-06 s

Total time: 23.7598 s
File: 
Function: original_vec at line 12

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    12                                           def original_vec(*args, **kwargs):
    13                                           
    14         1        86498  86498.0      0.4      kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
    15                                           
    16         1        69700  69700.0      0.3      r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
    17         1      1331947 1331947.0      5.6      phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
    18         1     22271637 22271637.0     93.7      kpt_data = h_r.dot(phase)
    19                                           
    20         1            4      4.0      0.0      return kpt_data

You can see that it spends 93% of its time computing the dot product between h_r and phase. Here, h_r is a (20, 20, 1000) array and phase is (1000, 10000). We're computing a sum product over the last dimension of h_r and the first dimension of phase (you could write this in einsum notation as ijk,kl->ijl).

The first two dimensions of h_r don't really matter here - we could just as easily reshape h_r into a (20*20, 1000) array before taking the dot product. It turns out that this reshaping operation by itself gives a huge performance improvement:

In [2]: %timeit h_r.dot(phase)
1 loop, best of 3: 22.6 s per loop

In [3]: %timeit h_r.reshape(-1, 1000).dot(phase)
1 loop, best of 3: 1.04 s per loop

I'm not entirely sure why this should be the case - I would have hoped that numpy's dot function would be smart enough to apply this simple optimization automatically. On my laptop the second case seems to use multiple threads whereas the first one doesn't, suggesting that it might not be calling multithreaded BLAS routines.

Here's a vectorized version that incorporates the reshaping operation:

def new_vec(*args, **kwargs):

    kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)

    phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen[:, None]
    kpt_data = h_r.reshape(-1, phase.shape[0]).dot(phase)

    return kpt_data.reshape(h_r.shape[:2] + (-1,))

The -1 indices tell numpy to infer the size of those dimensions according to the other dimensions and the number of elements in the array. I've also used broadcasting to divide by r_ndegen, which eliminates the need for np.tile.

By using the same random input data, we can check that the new version gives the same result as the original:

In [4]: ans1 = original_loop(seed=0)

In [5]: ans2 = new_vec(seed=0)    
In [6]: np.allclose(ans1, ans2)
Out[6]: True

Some performance benchmarks:

In [7]: %timeit original_loop()
1 loop, best of 3: 13.5 s per loop

In [8]: %timeit original_vec()
1 loop, best of 3: 24.1 s per loop

In [5]: %timeit new_vec()
1 loop, best of 3: 2.49 s per loop

Update:

I was curious about why np.dot was so much slower for the original (20, 20, 1000) h_r array, so I dug into the numpy source code. The logic implemented in multiarraymodule.c turns out to be shockingly simple:

#if defined(HAVE_CBLAS)
    if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
            (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
             NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
        return cblas_matrixproduct(typenum, ap1, ap2, out);
    }
#endif

In other words numpy just checks whether either of the input arrays has > 2 dimensions, and immediately falls back on a non-BLAS implementation of matrix-matrix multiplication. It seems like it shouldn't be too difficult to check whether the inner dimensions of the two arrays are compatible, and if so treat them as 2D and perform *gemm matrix-matrix multiplication on them. In fact there's an open feature request for this dating back to 2012, if any numpy devs are reading...

In the meantime, it's a nice performance trick to be aware of when multiplying tensors.

Update 2:

I forgot about np.tensordot. Since it calls the same underlying BLAS routines as np.dot on a 2D array, it can achieve the same performance bump, but without all those ugly reshape operations:

In [6]: %timeit np.tensordot(h_r, phase, axes=1)
1 loop, best of 3: 1.05 s per loop

performance loss after vectorization in numpy

Answers (2)

Update:

Update 2:

Related Questions