M. Smith
M. Smith

Reputation: 21

Clustering longitude and latitude gps data

I have more than 400 thousand cars GPS locations, like:

[ 25.41452217,  37.94879532],
[ 25.33231735,  37.93455887],
[ 25.44327736,  37.96868896],
... 

I need to make spatial clustering with the distance between points <= 3 meters.
I tried to use DBSCAN, but it seems that it is not working for geo(longitude, latitude).

Also, I do not know the number of clusters.

Upvotes: 2

Views: 6330

Answers (2)

Gengyu Shi
Gengyu Shi

Reputation: 111

You can use pairwise_distances to calculate Geo distance from latitude / longitude and then pass the distance matrix into DBSCAN, by specifying metric='precomputed'.

To calculate the distance matrix:

from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import DBSCAN
from geopy.distance import vincenty

def distance_in_meters(x, y):
    return vincenty((x[0], x[1]), (y[0], y[1])).m

distance_matrix = pairwise_distances(sample, metric=distance_in_meters)

To run DBSCAN using the matrix:

dbscan = DBSCAN(metric='precomputed', eps=3, min_samples=10)
dbscan.fit(distance_matrix)

Hope this helps.

Gengyu

Upvotes: 4

Erich Schubert
Erich Schubert

Reputation: 8725

DBSCAN is a reasonable choice, but you may get better results with a hierarchical clustering algorithm such as OPTICS and HDBSCAN*.

I did a blog post some time ago on clustering 23 million Tweet locations:

http://www.vitavonni.de/blog/201410/2014102301-clustering-23-mio-tweet-locations.html

Here is also a blog for clustering GPS points. She uses a very similar approach and gives much more details:

https://doublebyteblog.wordpress.com/

In essence, OPTICS works well for such data, and you really need to use an index such as the R*-tree or Cover tree in ELKI. Both work with Haversine distance and are really fast.

Upvotes: 2

Related Questions