Reputation: 137
I have a Pandas dataframe of towers, like:
site lat lon
18ALOP01 11.1278 14.3578
18ALOP02 11.1278 14.3578
18ALOP12 11.1288 14.3575
18PENO01 11.1580 14.2898
And I need to group them if they are too close (50m). Then, I made a script that performs a "self cross join", calculates the distance between the combinations of all sites and set the same id for those where the distance is less than a threshold. So, if I have n sites, it will calculate (n^2) - n
combinations, then, it is a poor algorithm. Is there a better way of doing that?
Upvotes: 2
Views: 1020
Reputation: 49768
Assuming the number and the "true" location of sites is unknown, you could try the MeanShift clustering algorithm. While that is a general-purpose algorithm and not highly scalable it will be faster than implementing your own clustering algorithm in python, and you could experiment with bin_seeding=True
as an optimization, if binning datapoints into a grid is an acceptable short-cut to prune the starting seeds. (Note: if binning datapoints to a grid, rather than computing Euclidian distance between points, is an acceptable "full" solution, that seems like it would be the fastest approach to your problem.)
Here's an example of scikit-learn's implementation of MeanShift, where the x/y coordinates are in meters, and the algorithm creates clusters with radius of 50m.
In [2]: from sklearn.cluster import MeanShift
In [3]: import numpy as np
In [4]: X = np.array([
...: [0, 1], [51, 1], [100, 1], [151, 1],
...: ])
In [5]: clustering = MeanShift(bandwidth=50).fit(X) # OR speed up with bin_seeding=True
In [6]: print(clustering.labels_)
[1 0 0 2]
In [7]: print(clustering.cluster_centers_)
[[ 75.5 1. ]
[ 0. 1. ]
[151. 1. ]]
Upvotes: 2