Reputation: 3603
KNN is such a straightforward algorithm that's easy to implement:
# for each test datapoint in X_test:
# calculate its distance from every points in X_train
# find the top k most closest points
# take majority vote of the k neighbors and use that as prediction for this test data point
Yet I think the time complexity is not good enough. How is the algorithm optimized when it is implemented in reality? (like what trick or data structure it's using?)
Upvotes: 0
Views: 933
Reputation: 1889
What you describe is the brute force kNN calculation with O(size(X_test)*size(X_train)*d), where d is the number of dimensions in the feature vectors.
More efficient solution use spatial indexing to put an index on the X_train data. This typically reduces the individual lookups to O( log(size(X_train)) * d) or even O( log(size(X_train)) + d).
Common spatial indexes are:
There are also the class of 'approximate' NN searches/queries. These trade correctness with speed, they may skip a few of the closest neighbors. You can find a performance comparison and numerous implementations in python here.
If you are looking for Java implementations of some of the spatial indexes above, have a look at my implementations.
Upvotes: 1
Reputation: 45
The k-nearest neighbor algorithm differs from other learning methods because no model is induced from the training examples. The data remains as they are; they are simply stored in memory.
A genetic algorithm is combined with k-NN to improve performance. Another successful technique known as instance selection is also proposed to face simultaneously, the efficient storage and noise of k-NN. you can try this: when a new instance should be classified; instead of involving all learning instances to retrieve the k-neighbors which will increase the computing time, a selection of a smaller subset of instances is first performed.
you can also try:
Upvotes: 1