anon.for
anon.for

Reputation: 57

Fastest creating a pair-wise distance matrix between rows of a 2D-array

I use string in np.array and it has N^2 complexity and for 10k rows and 10 columns it needs more than 1 hour, and it's too slow, but i cant imagine how to accelerate it

There is 2D array like:

X = np.array([['asd', 'qwe'], ['asd', 'rty']])

but a few bigger

which is the fastest way to create pairwise distance between all its rows?

Now i use

from Levenshtein import ratio 
#normalized levenshtein distance = levenshtein/(len(1st word)+len(2nd word))


vectorRatio = np.vectorize(ratio) 
Q = np.zeros((len(X), len(X)))
for i in range(len(X)):
    for j in range(len(X)):
        Q[i][j] = 10 - sum(vectorRatio(X[i],X[j]))

Upvotes: 0

Views: 181

Answers (1)

chrslg
chrslg

Reputation: 13346

Please, note that np.vectorize is not true vectorization, and is not meant for performance. It is basically a for loop. It is still faster than a pure for loop, because the for loop itself is done in C, but its contents is still just calls of pure python functions.

That being said, you are not using the (not so big) advantage of your np.vectorize. You have 3 nested loops (2 nested loops on rows, and 1 on columns), and only the last one is done by your "vectorized" functions. The two others could also be.

The inner loop of

for i in range(len(X)):
    for j in range(len(X)):
        Q[i][j] = 10 - sum(vectorRatio(X[i],X[j]))

could, for example, easily be

for i in range(len(X)):
    Q[i,:] = 10 - vectorRatio(X[i], X).sum(axis=1)

And, less obvious, but with some broadcasting, you call also let the outer loop to vectorRatio

Q=10 - vectorRatio(X[None,...], X[:,None,:]).sum(axis=2)

X[None,...] is an array containing a single 2d-array that is your original X. And X[:,None,:] is an array of len(X) arrays of 1xlen(X) arrays for each line. So, broadcasting forces to duplicate combine all lines with all lines. A little bit like np.arange(10).reshape(-1,1) + np.arange(10,20).reshape(1,-1) produces a 2d-array of all sums of a pair of a number in [0,10) and a number in [10,20)

So, no for loop left.

But again, vectorRatio in reality just calls ratio for each pair of strings. So in this example, 2×2×2 = 8 times. In your real case 10000×10000×10 = 1 billion times. The vectorization of np.vectorize just accelerate the for itself (the counting), but not the billion calls to ratio.

Timing

Case Your code No for loop
2 rows, 2 cols 162 μs 51 μs
10 rows 10 cols 4.25 ms 0.70 ms
100 rows 10 cols 425 ms 65 ms
1000 rows 10 cols 43 s 6.5 s

So, it is safe to say that (unsurprisingly) it is a O(n²), and for your 10k/10 case, it would take 4300 seconds for your code, vs 650 for mine.

Not a huge gain factor (compared to the factor well over 1000 that we usually get when we vectorize code). Because, again, not a true vectorization. But well, that is still 1 hour of waiting spared on a 1 hour and 10 minutes computation.

Upvotes: 1

Related Questions