Reputation: 185
I was implementing a weighting system called TF-IDF on a set of 42000 images, each consisting 784 pixels. This is basically a 42000 by 784 matrix.
The first method I attempted made use of explicit loops and took more than 2 hours.
def tfidf(color,img_pix,img_total):
if img_pix==0:
return 0
else:
return color * np.log(img_total/img_pix)
...
result = np.array([])
for img_vec in data_matrix:
double_vec = zip(img_vec,img_pix_vec)
result_row = np.array([tfidf(x[0],x[1],img_total) for x in double_vec])
try:
result = np.vstack((result,result_row))
# first row will throw a ValueError since vstack accepts rows of same len
except ValueError:
result = result_row
The second method I attempted used numpy matrices and took less than 5 minutes. Note that data_matrix
, img_pix_mat
are both 42000 by 784 matrices while img_total
is a scalar.
result = data_matrix * np.log(np.divide(img_total,img_pix_mat))
I was hoping someone could explain the immense difference in speed.
The authors of the following paper entitled "The NumPy array: a structure for efficient numerical computation" (http://arxiv.org/pdf/1102.1523.pdf), state on the top of page 4 that they observe a 500 times speed increase due to vectorized computation. I'm presuming much of the speed increase I'm seeing is due to this. However, I would like to go a step further and ask why numpy
vectorized computations are that much faster than standard python loops?
Also, perhaps you guys might know of other reasons why the first method is slow. Do try/except structures have high overhead? Or perhaps forming a new np.array
for each loop is takes a long time?
Thanks.
Upvotes: 15
Views: 11447
Reputation: 2309
Due to the internal workings of numpy, (as far as I know, numpy works with C internally, so everything you push down to numpy is actually much faster because it is in a different language)
Edit:
Taking out the zip
, and replacing it with a vstack
should make it faster too, (zip
tends to go slow if the arguments are very large, and for that vstack
is faster; additionally, vstack
is numpy (thus C), while zip is python).
And yes, if I understood correctly - not sure about that- , you are doing 42k times a try/except block. That should definitely be bad for the speed.
Test:
T=numpy.ndarray((5,10))
for t in T:
print t.shape
results in (10,)
This means that yes, if your matrices are 42k by 784 matrices, you are trying 42k times a try-except block. I am assuming that should put an effect in the computation times, as well as doing a zip
each time, but not certain if that would be the main cause.
(So every one of your 42k times you run your stuff, it takes 0.17sec. I am quite certain that a try/except block doesn't take 0.17 seconds, but maybe the overhead it causes or so does contribute to it?)
Try changing the following:
double_vec = zip(img_vec,img_pix_vec)
result_row = np.array([tfidf(x[0],x[1],img_total) for x in double_vec])
to
result_row = np.array([tfidf(img_vec[i],img_pix_vec[i],img_total)
for i in xrange(len(img_vec))])
That, at least, gets rid of the zip
statement. Not sure if the zip
statement takes down your time by one minute or by nearly two hours (I know zip
is slow, compared to numpy.vstack
, but no clue if that would give you two hours time gain.)
Upvotes: 9
Reputation: 282026
The difference you're seeing isn't due to anything fancy like SSE vectorization. There are two primary reasons. The first is that NumPy is written in C, and the C implementation doesn't have to go through the tons of runtime method dispatch and exception checking and so on that a Python implementation goes through.
The second reason is that even for Python code, your loop-based implementation is inefficient. You're using vstack
in a loop, and every time you call vstack
, it has to completely copy all arrays you've passed to it. That adds an extra factor of len(data_matrix)
to your asymptotic complexity.
Upvotes: 8