Reputation: 540
I am doing a research on approximate nearest neighbor algorithms.
I recently found the Annoy Library which does an amazing job in finding KNN in reasonable speed. For deeper analysis, you can skim the meetup slides.
After reading the slides and inspecting the source code, I cannot see a reason why this algorithm works with high dimensional data better than KD-Tree.
KD-Tree is a great NN algorithm however in high dimensions it achieves a running time which is almost identical to a brute force search [O(n^2)] and thus it requires to check many neighboring leafs.
The reason for that is that in high dimensions, the volume of a unit sphere gets way smaller (you can take a look here).
The only difference I found in the Annoy library is that the partitioning of the space is done by using hyperplanes rather than partition one dimension at a time.
Did some ever analyzed this algorithm and can explain why it is so much faster than KD-Tree?
Upvotes: 3
Views: 2931
Reputation: 73366
Read this section of Annoy:
How does it work:
Using random projections and by building up a tree. At every intermediate node in the tree, a random hyperplane is chosen, which divides the space into two subspaces. This hyperplane is chosen by sampling two points from the subset and taking the hyperplane equidistant from them.
We do this k times so that we get a forest of trees. k has to be tuned to your need, by looking at what tradeoff you have between precision and performance.
The key here is the forest of trees, I suppose. You are comparing versus a KD-tree, which is a rather old data structure, and as you said, suffers from the curse of dimensionality.
I think that using a forest of Randomized KD trees would be a good match here.
For example, my kd-GeRaF offers a good explanation on this (see the paper). If however the number of dimensions is relatively small, i.e. around 100, then FLANN should be interesting too. FLANN is older than kd-GeRaF, but inspired me much.
I do not see how LSH is used in Annoy, as suggested in the comments, but if it is, then it's no problem for an RKD forest.
Upvotes: 2