Reputation: 85
I am working on a KNN algorithm for a university assignment and at the moment I'm working on finding the Euclidean distance between each of the training vectors stored as a Scipy lil_matrix (due to the sparseness of the values in the vectors), and a testing vector stored as a 1 x n lil_matrix for the same reasons above.
To work out the euclidean distance I'm then doing the following code:
for positiveIndex, positivesComparison in enumerate(positives):
result.append((spatial.distance.euclidean(positivesComparison.todense(),sentenceVector.todense() ), positiveIndex, 1))
Where sentenceVector is a lil_matrix of 1 row, and positives is a lil_matrix of size n x m.
I want to try and work out something faster than going through the positives matrix row by row and evaluating the euclidean distance every time, and maybe run the euclidean distance between the positives matrix and the sentenceVector vector, and return a 1 x m matrix with the euclidean distances. The reason I want to do this is that the current system is relatively slow to compute as it is basically an NM time complexity as I need to compute more than one sentence test. Is this possible, and if yes, how would I do it?
Note, the task is to evaluate performance using different K values for the KNN algorithm and not on the actual implementation of the KNN (although we are not allowed to use KNN libraries to do the task)
Upvotes: 2
Views: 4836
Reputation: 17797
You can compute batch Euclidean distances pretty easily:
In [10]: a = np.random.random(size=(4,5))
In [11]: b = np.random.random(size=(1,5))
In [12]: from scipy.spatial.distance import euclidean
In [13]: [euclidean(aa, b) for aa in a]
Out[13]: [1.1430615949614429, 0.568517046878056, 1.3302284168375587, 1.0581730230363529]
In [14]: np.sqrt(np.sum((a - b)**2, axis=1))
Out[14]: array([ 1.1431, 0.5685, 1.3302, 1.0582])
But we want to use sparse matrices, which makes things a little more difficult:
In [22]: import scipy.sparse as ss
In [23]: sa = ss.lil_matrix(a)
In [24]: sb = ss.lil_matrix(b)
In [25]: np.sqrt(np.sum((sa - sb)**2, axis=1)) # <-- ValueError: inconsistent shapes
It's possible to do, but you'll need to use some tricks.
More importantly, you should look at how big (and how sparse) your vectors really are. You might be faster leaving everything dense, and it will certainly save you some headache.
Finally, I'd avoid using LIL-format matrices, as they're one of the slowest formats available. For your case, look into CSR format.
EDIT: I forgot the simplest solution: use scikit-learn!
In [36]: from sklearn.metrics import pairwise_distances
In [37]: pairwise_distances(a, b)
Out[37]:
array([[ 1.1431],
[ 0.5685],
[ 1.3302],
[ 1.0582]])
In [38]: pairwise_distances(sa, sb)
Out[38]:
array([[ 1.1431],
[ 0.5685],
[ 1.3302],
[ 1.0582]])
Upvotes: 4