Reputation: 3603

How to optimize the algorithm used to calculate K-nearest neighbor algorithm?

KNN is such a straightforward algorithm that's easy to implement:

# for each test datapoint in X_test:
#     calculate its distance from every points in X_train
#     find the top k most closest points  
#     take majority vote of the k neighbors and use that as prediction for this test data point

Yet I think the time complexity is not good enough. How is the algorithm optimized when it is implemented in reality? (like what trick or data structure it's using?)

Upvotes: 0

Answers (2)

TilmannZ

Reputation: 1889

What you describe is the brute force kNN calculation with O(size(X_test)*size(X_train)*d), where d is the number of dimensions in the feature vectors.

More efficient solution use spatial indexing to put an index on the X_train data. This typically reduces the individual lookups to O( log(size(X_train)) * d) or even O( log(size(X_train)) + d).

Common spatial indexes are:

kD-Trees (they are often used, but scale badly with 'd')
R-Trees, such as the RStarTree
Quadtrees (Usually not efficient for large 'd', but for example the PH-Tree works well with d=1000 and has excellent remove/insertion times (disclaimer, this is my own work))
BallTrees (I don't really know much about them)
CoverTrees (Very fast lookup for high 'd', but long build-up times

There are also the class of 'approximate' NN searches/queries. These trade correctness with speed, they may skip a few of the closest neighbors. You can find a performance comparison and numerous implementations in python here.

If you are looking for Java implementations of some of the spatial indexes above, have a look at my implementations.

Upvotes: 1

Bruce Truth

Reputation: 45

The k-nearest neighbor algorithm differs from other learning methods because no model is induced from the training examples. The data remains as they are; they are simply stored in memory.

A genetic algorithm is combined with k-NN to improve performance. Another successful technique known as instance selection is also proposed to face simultaneously, the efficient storage and noise of k-NN. you can try this: when a new instance should be classified; instead of involving all learning instances to retrieve the k-neighbors which will increase the computing time, a selection of a smaller subset of instances is first performed.

you can also try:

Improving k-NN speed by reducing the number of training documents
Improving k-NN by neighborhood size and similarity function
Improving k-NN by advanced storage structures

Upvotes: 1

How to optimize the algorithm used to calculate K-nearest neighbor algorithm?

Answers (2)

Related Questions