Reputation: 21
I have more than 400 thousand cars GPS locations, like:
[ 25.41452217, 37.94879532],
[ 25.33231735, 37.93455887],
[ 25.44327736, 37.96868896],
...
I need to make spatial clustering with the distance between points <= 3 meters.
I tried to use DBSCAN
, but it seems that it is not working for geo(longitude, latitude)
.
Also, I do not know the number of clusters.
Upvotes: 2
Views: 6330
Reputation: 111
You can use pairwise_distances to calculate Geo distance from latitude / longitude and then pass the distance matrix into DBSCAN, by specifying metric='precomputed'.
To calculate the distance matrix:
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import DBSCAN
from geopy.distance import vincenty
def distance_in_meters(x, y):
return vincenty((x[0], x[1]), (y[0], y[1])).m
distance_matrix = pairwise_distances(sample, metric=distance_in_meters)
To run DBSCAN using the matrix:
dbscan = DBSCAN(metric='precomputed', eps=3, min_samples=10)
dbscan.fit(distance_matrix)
Hope this helps.
Gengyu
Upvotes: 4
Reputation: 8725
DBSCAN is a reasonable choice, but you may get better results with a hierarchical clustering algorithm such as OPTICS and HDBSCAN*.
I did a blog post some time ago on clustering 23 million Tweet locations:
http://www.vitavonni.de/blog/201410/2014102301-clustering-23-mio-tweet-locations.html
Here is also a blog for clustering GPS points. She uses a very similar approach and gives much more details:
https://doublebyteblog.wordpress.com/
In essence, OPTICS works well for such data, and you really need to use an index such as the R*-tree or Cover tree in ELKI. Both work with Haversine distance and are really fast.
Upvotes: 2