maplemaple
maplemaple

Reputation: 1715

Cannot understand why more vectorization is slower than less vectorization in this case?

I meet a practical problem in which more vectorization is slower than less vectorization. I simplify the main problem into the following toy model. In the following codes, I use two different methods to realize the same functionality. f1 uses one loop, f2 uses two loops. Naively we will think that f1 is faster than f2. However f2 is faster than f1. I can't understand why this happens.

import numpy as np
def time_function(f, *args):
    import time
    tic = time.time()
    f(*args)
    toc = time.time()
    return toc - tic

def f2(X, Y):
    d = np.zeros((X.shape[0], Y.shape[0]))
    for i in range(X.shape[0]):
        for j in range(Y.shape[0]):
            d[i,j] = np.sum(np.square(X[i] - Y[j])) 
    return d

def f1(X, Y):
    d = np.zeros((X.shape[0], Y.shape[0]))
    for i in range(X.shape[0]):
        d[i] = np.sum(np.square(Y - X[i]), axis=1) 
    return d

X = np.ones((500, 3072))
Y = np.ones((5000, 3072))

time2 = time_function(f2,X,Y)
print('Two loop version took %f seconds' % time2)

time1 = time_function(f1,X,Y)
print('One loop version took %f seconds' % time1)
Two loop version took 24.691270 seconds
One loop version took 31.785896 seconds

Upvotes: 1

Views: 395

Answers (3)

hpaulj
hpaulj

Reputation: 231510

On my machine, f1 is slightly faster. But a fully vectorized version

D3=(np.square(Y[None,:,:]-X[:,None,:])).sum(axis=2)

gives a memory error because it has to create (500, 5000, 3072) array (57G).

There is a trade off between iterations and memory management. Often a few iterations on a relatively complex task are faster than one step that requires allocating a larger matrix. In your case there's the Y-X difference, but then it has to also do the square (same large size), before reduce the dimensions with the sum.

In [23]: timeit -n1 -r1 f2(X,Y)                                                                
1min 21s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [24]: timeit -n1 -r1 f1(X,Y)                                                                
1min 13s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

A variation on f1, that iterates on the 5000 rows of Y (instead of the 500 of X) times at 1min 25s. A few iterations are generally better than many, provided memory management issues don't penalize you.

numba

An iterative, but compiled version using numba is noticeably better:

In [32]: import numba 
    ...: @numba.jit(nopython=True) 
    ...: def f3(X, Y): 
    ...:     d = np.zeros((X.shape[0], Y.shape[0])) 
    ...:     for i in range(X.shape[0]): 
    ...:         for j in range(Y.shape[0]): 
    ...:             d[i,j] = np.sum(np.square(X[i] - Y[j]))  
    ...:     return d 
    ...:                                                                                       
In [33]: D3=f3(X,Y)                                                                            
In [34]: np.allclose(D,D3)                                                                     
Out[34]: True
In [35]: timeit -n1 -r1 f3(X,Y)                                                                
20 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

and making the sum iteration explicit shaves off some more time:

In [36]: @numba.jit(nopython=True) 
    ...: def f4(X, Y): 
    ...:     d = np.zeros((X.shape[0], Y.shape[0])) 
    ...:     for i in range(X.shape[0]): 
    ...:         for j in range(Y.shape[0]): 
    ...:             for k in range(X.shape[1]): 
    ...:                 d[i,j] += (X[i,k] - Y[j,k])**2  
    ...:     return d 
    ...:      
    ...:                                                                                       
In [37]: D4 = f4(X,Y)                                                                          
In [38]: np.allclose(D,D4)                                                                     
Out[38]: True
In [39]: timeit -n1 -r1 f4(X,Y)                                                                
10.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Upvotes: 2

CaptainTrunky
CaptainTrunky

Reputation: 1707

I guess that culprit hides at f1:

d[i] = np.sum(np.square(Y - X[i]), axis=1) 

You substract X[i] from the whole Y array each iteration, which causes intense broadcasting that involves iteration in a range 0 <= i <= 5000, while f2 substracts two distinct values, bounded by 0 <= i <= 500

Upvotes: 3

Paritosh Singh
Paritosh Singh

Reputation: 6246

I see iteration in both definitions. Vectorization gives gains only on large quantities! You're effectively only vectorizing one row worth of values at a time. Python looping is more than capable of handling that, the overhead of going for and setting up the vectorized calculation takes too long in comparison here. Your f1 isn't fully vectorized because of the iteration involved. Thus, this comparison isn't fair to vectorization, when iteration would suffice.

tl;dr have enough values that the vectorization "setup" is offset by the gains of vectorization itself.

Upvotes: 0

Related Questions