Reputation: 3345
I am trying to analyse some job latitude and longitude data. The nature of the jobs means they tend to happen at similar (although not identical) latitude/longitude locations
In order to reduce the amount of data points to display and analyse I want to cluster the jobs in similar geographical region together. To do this I was using DBSCAN to cluster the jobs and use a job close to the centre of the cluster to act as the representative point.
import pandas as pd, numpy as np
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
def cluster_with_dbscan(jobs, radius_km, min_samples):
# ignore jobs that are missing Lat Long coords for now
jobs_cluster = jobs[['Job ID', 'Lat', 'Long']].dropna()
# run dbscan
kms_per_radian = 6371.0088
epsilon = radius_km / kms_per_radian
coords = jobs_cluster[['Lat', 'Long']].values
db = DBSCAN(eps=epsilon, min_samples=min_samples, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
# appending cluster data onto original jobs, preserving jobs that never had location data
jobs_cluster['Cluster ID'] = db.labels_
jobs_with_cluster = pd.merge(jobs, jobs_cluster[['Job ID', 'Cluster ID']], how='left', on=['Job ID'])
# capture cluster data, including centroids
num_clusters = len(set(db.labels_))
clusters = pd.Series([coords[db.labels_ == n] for n in range(num_clusters)])
def get_centermost_point(cluster):
if len(cluster) == 0:
return tuple([None,None])
centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
return tuple(centermost_point)
centermost_points = clusters.map(get_centermost_point)
lats, lons = zip(*centermost_points)
clusters = pd.DataFrame({'Lat':lats, 'Long':lons}).reset_index().rename(columns={'index':'Cluster ID'})
return jobs_with_cluster,clusters
run with
radius_km = 2
min_samples = 1 //want to keep outliers
jobs,clusters = cluster_with_dbscan(jobs, radius_km , min_samples )
When running I do get clustered data, but the clusters contain jobs that are far more than 2km apart (some clusters have jobs spanning 100s of kilometres). From my understanding of DBSCAN they should only be at most 2km from the core point
Is my understanding of DBSCAN wrong? can clusters cover areas greater than equivalent epsilon value? If so is there a more appropriate cluster algorithm?
Or is my implementation of DBSCAN flawed in someway?
Upvotes: 1
Views: 30
Reputation: 3345
You understanding of DBSCAN is wrong
DBSCAN will cluster by chaining items together to form a larger continuous group where that is at least min_samples
number of points radius_km
away. Note that not all points of the cluster need to be radius_km
away from the centroid
So in your example two points from the same cluster can be 100km apart, as long as there is chain of at least 50 points each at most 2km apart between them
An algorithm that more closely fits your requirements is Canopy Clustering. Unfortunatly it is not included in most of the libraries, but a simple implementation can be found at https://gist.github.com/gdbassett/528d816d035f2deaaca1. If you want items to belong only a single cluster then set T1=T2=2km
Upvotes: 0