Reputation: 11
I am interested in finding the distribution of nearest neighbor cluster distance in a spatial data set (lon, lat). My cluster criteria is simple, meaning that when two points are next to each other they belong to the same cluster and the minimum number of points in a cluster is one. To do so I am using sklearn.cluster.DBSCAN. After clustering, I want to find the distance to the closest cluster for each cluster and that's where I am having problems. Everything I have found calculates the nearest neighbor distance between the centroids of the clusters, and I want to use the boundaries instead.
At the moment I am doing so by taking all the points from one cluster, then calculating the distance of every point of this cluster with all the points of the remaining clusters and finally taking the minimum distance. However, as you can imagine this is very inefficient and the calculation takes forever.
Does anyone knows how to properly do this?
Upvotes: 1
Views: 1060
Reputation: 77454
Use the nearesr-neighbor classifier.
But in all points, not the cluster centers!
Sklearn has utility functions that can make finding the nearest neighbor faster than computing all distances, for example using a kd-tree.
Upvotes: 0