Reputation: 803
As an assignment I have to create my own kNN classifier, without the use of for loops. I've managed to use scipy.spatial.KDTree
to find the nearest neighbour(s) of each vector in the testing set, and then I use scipy.stats.mode
to return the list of the predicted classes. However, this takes extremely long when the size of the set is very large. For example, I have created the following example inspired by this page
import numpy as np
from sklearn.model_selection import train_test_split
from scipy import spatial
from scipy.stats import mode
def predict(X_test):
X = Y_train[tree.query(X_test, k=k)[1]]
Y = mode(X, axis=-1)[0].T[0]
return Y
def load_data():
x1 = 1.5 * np.random.randn(100) + 1
y1 = 1.5 * np.random.randn(100) + 2
x2 = 1.5 * np.random.randn(100) + 3
y2 = 1.5 * np.random.randn(100) + 4
X = np.vstack((np.hstack((x1,x2)),np.hstack((y1,y2)))).T
y = 1.0*np.hstack((np.zeros(100), np.ones(100)))
return X, y
if __name__ == '__main__':
X, y = load_data()
X_train, X_test, Y_train, Y_test = train_test_split(X, y)
k = 7
Z = predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
This takes very long (40-60 seconds!) because of the X = Y_train[tree.query(X_test, k=k)[1]]
part. Is there any way I can improve the speed of this specific implementation, or should I think of another way to do it? For example, sklearn
's implementation only takes 0.4 seconds, which is insanely fast compared to my implementation.
Upvotes: 1
Views: 872
Reputation: 2041
Had to read your a code a couple of times, but then I saw you're using KDTree
and not cKDTree
. The latter is implemented in Cython (instead of plain python and numpy) and should give you a decent speedup.
Upvotes: 3