user3734029
user3734029

Reputation: 85

C++ Data Structure for k Nearest Neighbour Search in D dimension using only point cloud as query points

I have a point cloud of N points in D-dimensional space with periodic boundary conditions, where N can range from 500 to 10^8 and D can range from 1 to 20. The distribution of points varies wildly, from completely uniform to very clumped together. For each point in the point cloud I need to find the k nearest neighbours to that point. I also need to find how many points exist within a distance of each point, specifically the maxnorm distance. I don't need to know which points are within the radius, just how many, but it would be a nice addition.

I've tried kd-trees, but they don't handle the wrapping boundaries, and for the larger trees, duplication is not feasible. Additionally, it gets slow at higher dimensions.

I've just come across Vantage Point Trees, and tried out some code, but it is slower than the kd-tree. Although the code I found uses a recursive search method, with no batching. One the positive side, it can natively handle the wrapping conditions, and as such doesn't require duplication.

I'm about to see if I can squeeze some more performance out of the VP tree by converting to an iterative approach and seeing if I could batch search, but I had a thought. All these data structures work for finding nearest neighbours to arbitrary query points, while my query points are restricted to being points in the point cloud. I figure this restriction might allow for some more performant structure (maybe a nav-mesh of sorts?). I tried searching for structures that would handle this, but my google-fu is failing me. So just wondering if anyone knows of a data structure that can handle the following:

Thanks

Upvotes: 1

Views: 2468

Answers (2)

TilmannZ
TilmannZ

Reputation: 1889

If Java is an option (performance is similar to C++ nowadays), have a look at the ELKI library. It provides implementations of numerous multi-dimensional indexes, including approaches to dimensionality reduction and space-filling curves. It provides also numerous algorithms for kNN (euclidic/non-euclidic), cluster-detection, range-queries, and so on (you can usually define your own query filter with custom distance metric). For kNN I can specifically recommend the CoverTree and (a bit slower, but more general purpose) the PH-Tree, I tested both with up to 27 dimensions. The PH-Tree is especially suitable to highly clustered and large datasets (I tested over 100,000,000 points). (Disclaimer: The PH-Tree is based on my own research, but I think your use case is an excellent fit.)

However to my knowledge, none of these approaches allows special optimisations as you proposed.

Upvotes: 0

SpamBot
SpamBot

Reputation: 1458

I doubt that there is a complete and definite answer to your very complex problem, so I just share my thoughts. Your problem specification combines a number of things that don't work well together (high dimension, non-Euclidean metric, completely different types of queries). If an algorithm has to assume the generic case, it is necessarily slow.

Let's first sort out the special cases where good data structures are known.

  • If your dimension is 1, use a sorted map.
  • If your dimension is 2-3 (perhaps even 4), sorted lookups and geographic databases should be optimal. https://en.wikipedia.org/wiki/R-tree
  • If your points have a higher dimension but very strong correlation, dimensionality reduction can possibly map your point cloud to one that has such low dimension and reduce the problem to an easy one. https://en.wikipedia.org/wiki/Dimensionality_reduction
  • If your number of points is lower than 10^6, brute force is cheapest. Just calculate the distance with your metric for all points, then do a partial sort for k results. These simple cache-coherent calculations are faster than using tree structures. http://en.cppreference.com/w/cpp/algorithm/partial_sort
  • If your k is bounded, say k <= 20, and you optimize for query times, precompute a table with all results.
  • If only few of your dimensions are periodic, I think you should adapt the kd-tree algorithm to handle them (adding more complex comparison nodes for those dimensions similar to those in Vantage Point Trees).

If all this does not apply (if you have a practical application in mind, please share with us), you case is very generic.

In addition to the algorithms you mentioned, you should also try out Geometric Near-Neighbor Access trees (GNAT). http://infolab.stanford.edu/~sergey/near.html They apply to generic metrics (including yours) and also handle non-uniform distributions.

Also, I think your expectations are very high. You could compare to a good kd-tree implementation (for instance, https://github.com/mariusmuja/flann) that solves the problem with a Euclidean metric only. If that takes long, you should not expect more general metrics to solve faster.

Admittedly, more generic method cannot use your constraint that queries are points in the cloud. I would be very interested in if there is any such solution.

Upvotes: 3

Related Questions