Reputation: 1715
I meet a practical problem in which more vectorization is slower than less vectorization. I simplify the main problem into the following toy model.
In the following codes, I use two different methods to realize the same functionality. f1
uses one loop, f2
uses two loops. Naively we will think that f1
is faster than f2
. However f2
is faster than f1
. I can't understand why this happens.
import numpy as np
def time_function(f, *args):
import time
tic = time.time()
f(*args)
toc = time.time()
return toc - tic
def f2(X, Y):
d = np.zeros((X.shape[0], Y.shape[0]))
for i in range(X.shape[0]):
for j in range(Y.shape[0]):
d[i,j] = np.sum(np.square(X[i] - Y[j]))
return d
def f1(X, Y):
d = np.zeros((X.shape[0], Y.shape[0]))
for i in range(X.shape[0]):
d[i] = np.sum(np.square(Y - X[i]), axis=1)
return d
X = np.ones((500, 3072))
Y = np.ones((5000, 3072))
time2 = time_function(f2,X,Y)
print('Two loop version took %f seconds' % time2)
time1 = time_function(f1,X,Y)
print('One loop version took %f seconds' % time1)
Two loop version took 24.691270 seconds
One loop version took 31.785896 seconds
Upvotes: 1
Views: 395
Reputation: 231510
On my machine, f1
is slightly faster. But a fully vectorized version
D3=(np.square(Y[None,:,:]-X[:,None,:])).sum(axis=2)
gives a memory error because it has to create (500, 5000, 3072) array (57G).
There is a trade off between iterations and memory management. Often a few iterations on a relatively complex task are faster than one step that requires allocating a larger matrix. In your case there's the Y-X
difference, but then it has to also do the square
(same large size), before reduce the dimensions with the sum
.
In [23]: timeit -n1 -r1 f2(X,Y)
1min 21s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [24]: timeit -n1 -r1 f1(X,Y)
1min 13s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
A variation on f1
, that iterates on the 5000 rows of Y
(instead of the 500 of X
) times at 1min 25s
. A few iterations are generally better than many, provided memory management issues don't penalize you.
An iterative, but compiled version using numba
is noticeably better:
In [32]: import numba
...: @numba.jit(nopython=True)
...: def f3(X, Y):
...: d = np.zeros((X.shape[0], Y.shape[0]))
...: for i in range(X.shape[0]):
...: for j in range(Y.shape[0]):
...: d[i,j] = np.sum(np.square(X[i] - Y[j]))
...: return d
...:
In [33]: D3=f3(X,Y)
In [34]: np.allclose(D,D3)
Out[34]: True
In [35]: timeit -n1 -r1 f3(X,Y)
20 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
and making the sum iteration explicit shaves off some more time:
In [36]: @numba.jit(nopython=True)
...: def f4(X, Y):
...: d = np.zeros((X.shape[0], Y.shape[0]))
...: for i in range(X.shape[0]):
...: for j in range(Y.shape[0]):
...: for k in range(X.shape[1]):
...: d[i,j] += (X[i,k] - Y[j,k])**2
...: return d
...:
...:
In [37]: D4 = f4(X,Y)
In [38]: np.allclose(D,D4)
Out[38]: True
In [39]: timeit -n1 -r1 f4(X,Y)
10.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Upvotes: 2
Reputation: 1707
I guess that culprit hides at f1:
d[i] = np.sum(np.square(Y - X[i]), axis=1)
You substract X[i] from the whole Y array each iteration, which causes intense broadcasting that involves iteration in a range 0 <= i <= 5000, while f2
substracts two distinct values, bounded by 0 <= i <= 500
Upvotes: 3
Reputation: 6246
I see iteration in both definitions. Vectorization gives gains only on large quantities! You're effectively only vectorizing one row worth of values at a time. Python looping is more than capable of handling that, the overhead of going for and setting up the vectorized calculation takes too long in comparison here. Your f1
isn't fully vectorized because of the iteration involved. Thus, this comparison isn't fair to vectorization, when iteration would suffice.
tl;dr have enough values that the vectorization "setup" is offset by the gains of vectorization itself.
Upvotes: 0