user5368737
user5368737

Reputation: 803

Improving performance (speed) of kNN classifier

As an assignment I have to create my own kNN classifier, without the use of for loops. I've managed to use scipy.spatial.KDTree to find the nearest neighbour(s) of each vector in the testing set, and then I use scipy.stats.mode to return the list of the predicted classes. However, this takes extremely long when the size of the set is very large. For example, I have created the following example inspired by this page

import numpy as np
from sklearn.model_selection import train_test_split
from scipy import spatial
from scipy.stats import mode

def predict(X_test):
    X = Y_train[tree.query(X_test, k=k)[1]]
    Y = mode(X, axis=-1)[0].T[0]
    return Y

def load_data():
    x1 = 1.5 * np.random.randn(100) + 1
    y1 = 1.5 * np.random.randn(100) + 2
    x2 = 1.5 * np.random.randn(100) + 3
    y2 = 1.5 * np.random.randn(100) + 4
    X  = np.vstack((np.hstack((x1,x2)),np.hstack((y1,y2)))).T
    y  = 1.0*np.hstack((np.zeros(100), np.ones(100)))
    return X, y

if __name__ == '__main__':
    X, y = load_data()
    X_train, X_test, Y_train, Y_test = train_test_split(X, y)

    k = 7
    Z = predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

This takes very long (40-60 seconds!) because of the X = Y_train[tree.query(X_test, k=k)[1]] part. Is there any way I can improve the speed of this specific implementation, or should I think of another way to do it? For example, sklearn's implementation only takes 0.4 seconds, which is insanely fast compared to my implementation.

Upvotes: 1

Views: 872

Answers (1)

user7138814
user7138814

Reputation: 2041

Had to read your a code a couple of times, but then I saw you're using KDTree and not cKDTree. The latter is implemented in Cython (instead of plain python and numpy) and should give you a decent speedup.

Upvotes: 3

Related Questions