Reputation: 1006

when will KD tree search for KNN not work?

I've been exploring and learning about KD Trees for KNN (K Nearest Neighbors problem) when would the search not work? or would be worth or not improve the naive search. are there any drawbacks of this approach?

Upvotes: 2

Answers (3)

Ravi

Reputation: 3217

Time complexity for knn :O(k * lg(n))

where k is k-nearest neighbours and lg(n) is kd-tree height

kd-trees will not work well if the dimensions of the data set is high because of such huge space.

lets consider you have many points around the origin ,for simplicity consider in 2-D

If you want to find k-nearest neighbours for any point ,then you have to search along 4 axes because all points are closer to each other which results in backtracking to other axis in kd-tree,

So for a 3-dimensional space we have to search along 8 directions

To generalize for n -dimensional it is 2^k

So the time-complexity becomes O(2^k * lg(n))

Upvotes: 1

Danica

Reputation: 28846

K-d trees don't work too well in high dimensions (where you have to visit lots and lots of tree branches). One rule of thumb is that if your data dimensionality is k, a k-d tree is only going to be any good if you have many more than 2^k data points.

In high dimensions, you'll generally want to switch to approximate nearest-neighbor searches instead. If you haven't run across it already, FLANN ( github ) is a very useful library for this (with C, C++, python, and matlab APIs); it has good implementations of k-d trees, brute-force search, and several approximate techniques, and it helps you automatically tune their parameters and switch between them easily.

Upvotes: 4

Has QUIT--Anony-Mousse

Reputation: 77454

It depends on your distance function.

You can't use k-d-trees with arbitrary distance functions. Minkowski norms should be fine though. But in a lot of applications, you will want to use more advanced distance functions.

Plus, with increasing dimensionality, k-d-trees work much less good.

The reason is simple: k-d-trees avoid looking at points where the one-dimensional distance to the boundary is already larger than the desired threshold, i.e. where for Euclidean distances (where z is the nearest border, y the closes known point):

(x_j - z_j)      <=>   sqrt(sum_i((x_i - y_i)^2))
equivalently, but cheaper:
(x_j - z_j)^2    <=>   sum_i((x_i - y_i)^2)

You can imagine that the chance of this pruning rule holding decrease drastically with the number of dimensions. If you have 100 dimensions, there is next to no chance that a single dimensions squared difference will be larger than the sum of squared differences.

Upvotes: 2

when will KD tree search for KNN not work?

Answers (3)

Related Questions