Reputation: 57
I use string in np.array and it has N^2 complexity and for 10k rows and 10 columns it needs more than 1 hour, and it's too slow, but i cant imagine how to accelerate it
There is 2D array like:
X = np.array([['asd', 'qwe'], ['asd', 'rty']])
but a few bigger
which is the fastest way to create pairwise distance between all its rows?
Now i use
from Levenshtein import ratio
#normalized levenshtein distance = levenshtein/(len(1st word)+len(2nd word))
vectorRatio = np.vectorize(ratio)
Q = np.zeros((len(X), len(X)))
for i in range(len(X)):
for j in range(len(X)):
Q[i][j] = 10 - sum(vectorRatio(X[i],X[j]))
Upvotes: 0
Views: 181
Reputation: 13346
Please, note that np.vectorize
is not true vectorization, and is not meant for performance. It is basically a for loop. It is still faster than a pure for loop, because the for loop itself is done in C, but its contents is still just calls of pure python functions.
That being said, you are not using the (not so big) advantage of your np.vectorize
. You have 3 nested loops (2 nested loops on rows, and 1 on columns), and only the last one is done by your "vectorized" functions. The two others could also be.
The inner loop of
for i in range(len(X)):
for j in range(len(X)):
Q[i][j] = 10 - sum(vectorRatio(X[i],X[j]))
could, for example, easily be
for i in range(len(X)):
Q[i,:] = 10 - vectorRatio(X[i], X).sum(axis=1)
And, less obvious, but with some broadcasting, you call also let the outer loop to vectorRatio
Q=10 - vectorRatio(X[None,...], X[:,None,:]).sum(axis=2)
X[None,...]
is an array containing a single 2d-array that is your original X.
And X[:,None,:]
is an array of len(X)
arrays of 1xlen(X)
arrays for each line.
So, broadcasting forces to duplicate combine all lines with all lines. A little bit like np.arange(10).reshape(-1,1) + np.arange(10,20).reshape(1,-1)
produces a 2d-array of all sums of a pair of a number in [0,10) and a number in [10,20)
So, no for loop left.
But again, vectorRatio
in reality just calls ratio
for each pair of strings. So in this example, 2×2×2 = 8 times. In your real case 10000×10000×10 = 1 billion times. The vectorization of np.vectorize
just accelerate the for
itself (the counting), but not the billion calls to ratio
.
Case | Your code | No for loop |
---|---|---|
2 rows, 2 cols | 162 μs | 51 μs |
10 rows 10 cols | 4.25 ms | 0.70 ms |
100 rows 10 cols | 425 ms | 65 ms |
1000 rows 10 cols | 43 s | 6.5 s |
So, it is safe to say that (unsurprisingly) it is a O(n²), and for your 10k/10 case, it would take 4300 seconds for your code, vs 650 for mine.
Not a huge gain factor (compared to the factor well over 1000 that we usually get when we vectorize code). Because, again, not a true vectorization. But well, that is still 1 hour of waiting spared on a 1 hour and 10 minutes computation.
Upvotes: 1