C++ Data Structure for k Nearest Neighbour Search in D dimension using only point cloud as query points

Question

I have a point cloud of N points in D-dimensional space with periodic boundary conditions, where N can range from 500 to 10^8 and D can range from 1 to 20. The distribution of points varies wildly, from completely uniform to very clumped together. For each point in the point cloud I need to find the k nearest neighbours to that point. I also need to find how many points exist within a distance of each point, specifically the maxnorm distance. I don't need to know which points are within the radius, just how many, but it would be a nice addition.

I've tried kd-trees, but they don't handle the wrapping boundaries, and for the larger trees, duplication is not feasible. Additionally, it gets slow at higher dimensions.

I've just come across Vantage Point Trees, and tried out some code, but it is slower than the kd-tree. Although the code I found uses a recursive search method, with no batching. One the positive side, it can natively handle the wrapping conditions, and as such doesn't require duplication.

I'm about to see if I can squeeze some more performance out of the VP tree by converting to an iterative approach and seeing if I could batch search, but I had a thought. All these data structures work for finding nearest neighbours to arbitrary query points, while my query points are restricted to being points in the point cloud. I figure this restriction might allow for some more performant structure (maybe a nav-mesh of sorts?). I tried searching for structures that would handle this, but my google-fu is failing me. So just wondering if anyone knows of a data structure that can handle the following:

Handle a small and large number of points, i.e 500-10^8 points
Handle up to 20 dimensions
Work with periodic boundaries (i.e., a flat torus)
Work with maxnorm distance (soft requirement. Euclidean can give me a potential list which I can manually cull, but maxnorm would be preferred)
Can find k-NN to query point as well as find how many points exist with distance to query point
Query points are only points in the structure, not arbitrary points
Queries can be batched. i.e I need to find the k-th NN for every point in the point cloud. I also need to find how many points exist within d[i] for each point i. That is each point has a different search radius.
Doesn't need to support insert or deletion.

Thanks

SpamBot · Accepted Answer

I doubt that there is a complete and definite answer to your very complex problem, so I just share my thoughts. Your problem specification combines a number of things that don't work well together (high dimension, non-Euclidean metric, completely different types of queries). If an algorithm has to assume the generic case, it is necessarily slow.

Let's first sort out the special cases where good data structures are known.

If your dimension is 1, use a sorted map.
If your dimension is 2-3 (perhaps even 4), sorted lookups and geographic databases should be optimal. https://en.wikipedia.org/wiki/R-tree
If your points have a higher dimension but very strong correlation, dimensionality reduction can possibly map your point cloud to one that has such low dimension and reduce the problem to an easy one. https://en.wikipedia.org/wiki/Dimensionality_reduction
If your number of points is lower than 10^6, brute force is cheapest. Just calculate the distance with your metric for all points, then do a partial sort for k results. These simple cache-coherent calculations are faster than using tree structures. http://en.cppreference.com/w/cpp/algorithm/partial_sort
If your k is bounded, say k <= 20, and you optimize for query times, precompute a table with all results.
If only few of your dimensions are periodic, I think you should adapt the kd-tree algorithm to handle them (adding more complex comparison nodes for those dimensions similar to those in Vantage Point Trees).

If all this does not apply (if you have a practical application in mind, please share with us), you case is very generic.

In addition to the algorithms you mentioned, you should also try out Geometric Near-Neighbor Access trees (GNAT). http://infolab.stanford.edu/~sergey/near.html They apply to generic metrics (including yours) and also handle non-uniform distributions.

Also, I think your expectations are very high. You could compare to a good kd-tree implementation (for instance, https://github.com/mariusmuja/flann) that solves the problem with a Euclidean metric only. If that takes long, you should not expect more general metrics to solve faster.

Admittedly, more generic method cannot use your constraint that queries are points in the cloud. I would be very interested in if there is any such solution.

C++ Data Structure for k Nearest Neighbour Search in D dimension using only point cloud as query points

Answers (2)

Related Questions