Krogiar
Krogiar

Reputation: 137

Group GPS points with Pandas

I have a Pandas dataframe of towers, like:

site       lat      lon
18ALOP01   11.1278  14.3578
18ALOP02   11.1278  14.3578
18ALOP12   11.1288  14.3575
18PENO01   11.1580  14.2898

And I need to group them if they are too close (50m). Then, I made a script that performs a "self cross join", calculates the distance between the combinations of all sites and set the same id for those where the distance is less than a threshold. So, if I have n sites, it will calculate (n^2) - n combinations, then, it is a poor algorithm. Is there a better way of doing that?

Upvotes: 2

Views: 1020

Answers (1)

Garrett
Garrett

Reputation: 49768

Assuming the number and the "true" location of sites is unknown, you could try the MeanShift clustering algorithm. While that is a general-purpose algorithm and not highly scalable it will be faster than implementing your own clustering algorithm in python, and you could experiment with bin_seeding=True as an optimization, if binning datapoints into a grid is an acceptable short-cut to prune the starting seeds. (Note: if binning datapoints to a grid, rather than computing Euclidian distance between points, is an acceptable "full" solution, that seems like it would be the fastest approach to your problem.)

Here's an example of scikit-learn's implementation of MeanShift, where the x/y coordinates are in meters, and the algorithm creates clusters with radius of 50m.

In [2]: from sklearn.cluster import MeanShift

In [3]: import numpy as np

In [4]: X = np.array([
   ...:     [0, 1], [51, 1], [100, 1], [151, 1],
   ...: ])

In [5]: clustering = MeanShift(bandwidth=50).fit(X)  # OR speed up with bin_seeding=True

In [6]: print(clustering.labels_)
[1 0 0 2]

In [7]: print(clustering.cluster_centers_)
[[ 75.5   1. ]
 [  0.    1. ]
 [151.    1. ]]

Upvotes: 2

Related Questions