Eoin Ó Coinnigh
Eoin Ó Coinnigh

Reputation: 317

Faster kNN Classification Algorithm in Python

I want to code my own kNN algorithm from scratch, the reason is that I need to weight the features. The problem is that my program is still really slow despite removing for loops and using built in numpy functionality.

Can anyone suggest a way to speed this up? I don't use np.sqrt for the L2 distance because it's unnecessary and actually slows it all up quite a bit.

class GlobalWeightedKNN:
    """
    A k-NN classifier with feature weights

    Returns: predictions of k-NN.
    """

    def __init__(self):
        self.X_train = None
        self.y_train = None
        self.k = None
        self.weights = None
        self.predictions = list()

    def fit(self, X_train, y_train, k, weights):        
        self.X_train = X_train
        self.y_train = y_train
        self.k = k
        self.weights = weights

    def predict(self, testing_data):
        """
        Takes a 2d array of query cases.

        Returns a list of predictions for k-NN classifier
        """

        np.fromiter((self.__helper(qc) for qc in testing_data), float)  
        return self.predictions


    def __helper(self, qc):
        neighbours = np.fromiter((self.__weighted_euclidean(qc, x) for x in self.X_train), float)
        neighbours = np.array([neighbours]).T 
        indexes = np.array([range(len(self.X_train))]).T
        neighbours = np.append(indexes, neighbours, axis=1)

        # Sort by second column - distances
        neighbours = neighbours[neighbours[:,1].argsort()]  
        k_cases = neighbours[ :self.k]
        indexes = [x[0] for x in k_cases]

        y_answers = [self.y_train[int(x)] for x in indexes]
        answer = max(set(y_answers), key=y_answers.count)  # get most common value
        self.predictions.append(answer)


    def __weighted_euclidean(self, qc, other):
        """
        Custom weighted euclidean distance

        returns: floating point number
        """

        return np.sum( ((qc - other)**2) * self.weights )

Upvotes: 10

Views: 21998

Answers (3)

Sandipan Dey
Sandipan Dey

Reputation: 23101

Modifying your class and using BallTree data structure (with build time O(n.(log n)^2), refer to https://arxiv.org/ftp/arxiv/papers/1210/1210.6122.pdf) with custom DistanceMetric (since Callable functions in the metric parameter are NOT supported for KDTree, as mentioned here as a note: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html), you can use the following code too (also removing the loop for prediction):

from sklearn.neighbors import BallTree
from sklearn.neighbors import DistanceMetric
from scipy.stats import mode

class GlobalWeightedKNN:
    """
    A k-NN classifier with feature weights

    Returns: predictions of k-NN.
    """

    def __init__(self):
        self.X_train = None
        self.y_train = None
        self.k = None
        self.weights = None
        self.tree = None
        self.predictions = list()

    def fit(self, X_train, y_train, k, weights):        
        self.X_train = X_train
        self.y_train = y_train
        self.k = k
        self.weights = weights
        self.tree = BallTree(X_train, \
                             metric=DistanceMetric.get_metric('wminkowski', p=2, w=weights))

    def predict(self, testing_data):
        """
        Takes a 2d array of query cases.

        Returns a list of predictions for k-NN classifier
        """
        indexes = self.tree.query(testing_data, self.k, return_distance=False)
        y_answers = self.y_train[indexes]
        self.predictions = np.apply_along_axis(lambda x: mode(x)[0], 1, y_answers)
        return self.predictions

Training:

from time import time
n, d = 10000, 2
begin = time()
cls = GlobalWeightedKNN()
X_train = np.random.rand(n,d)
y_train = np.random.choice(2,n, replace=True)
cls.fit(X_train, y_train, k=3, weights=np.random.rand(d))
end = time()
print('time taken to train {} instances = {} s'.format(n, end - begin))
# time taken to train 10000 instances = 0.01998615264892578 s

Testing / prediction:

begin = time()
X_test = np.random.rand(n,d)
cls.predict(X_test)
end = time()
print('time taken to predict {} instances  = {} s'.format(n, end - begin))
#time taken to predict 10000 instances  = 3.732935905456543 s

Upvotes: 2

amirhe
amirhe

Reputation: 2341

you can take a look at this great article introducing faiss
Make kNN 300 times faster than Scikit-learn’s in 20 lines!
it is on GPU and developed in CPP behind the seen

import numpy as np
import faiss


class FaissKNeighbors:
    def __init__(self, k=5):
        self.index = None
        self.y = None
        self.k = k

    def fit(self, X, y):
        self.index = faiss.IndexFlatL2(X.shape[1])
        self.index.add(X.astype(np.float32))
        self.y = y

    def predict(self, X):
        distances, indices = self.index.search(X.astype(np.float32), k=self.k)
        votes = self.y[indices]
        predictions = np.array([np.argmax(np.bincount(x)) for x in votes])
        return predictions

Upvotes: 9

jakevdp
jakevdp

Reputation: 86330

Scikit-learn uses a KD Tree or Ball Tree to compute nearest neighbors in O[N log(N)] time. Your algorithm is a direct approach that requires O[N^2] time, and also uses nested for-loops within Python generator expressions which will add significant computational overhead compared to optimized code.

If you'd like to compute weighted k-neighbors classification using a fast O[N log(N)] implementation, you can use sklearn.neighbors.KNeighborsClassifier with the weighted minkowski metric, setting p=2 (for euclidean distance) and setting w to your desired weights. For example:

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(metric='wminkowski', p=2,
                             metric_params=dict(w=weights))
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)

Upvotes: 13

Related Questions