DBSCAN clustering geolocations beyond the epsilon value

Question

I am trying to analyse some job latitude and longitude data. The nature of the jobs means they tend to happen at similar (although not identical) latitude/longitude locations

In order to reduce the amount of data points to display and analyse I want to cluster the jobs in similar geographical region together. To do this I was using DBSCAN to cluster the jobs and use a job close to the centre of the cluster to act as the representative point.

import pandas as pd, numpy as np
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
from shapely.geometry import MultiPoint

def cluster_with_dbscan(jobs, radius_km, min_samples):

    # ignore jobs that are missing Lat Long coords for now
    jobs_cluster = jobs[['Job ID', 'Lat', 'Long']].dropna()

    # run dbscan
    kms_per_radian = 6371.0088
    epsilon = radius_km / kms_per_radian
    coords = jobs_cluster[['Lat', 'Long']].values
    db = DBSCAN(eps=epsilon, min_samples=min_samples, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

    # appending cluster data onto original jobs, preserving jobs that never had location data
    jobs_cluster['Cluster ID'] = db.labels_
    jobs_with_cluster = pd.merge(jobs, jobs_cluster[['Job ID', 'Cluster ID']], how='left', on=['Job ID'])

    # capture cluster data, including centroids
    num_clusters = len(set(db.labels_))
    clusters = pd.Series([coords[db.labels_ == n] for n in range(num_clusters)])

    def get_centermost_point(cluster):
        if len(cluster) == 0:
            return tuple([None,None])
        centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
        centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
        return tuple(centermost_point)

    centermost_points = clusters.map(get_centermost_point)

    lats, lons = zip(*centermost_points)
    clusters = pd.DataFrame({'Lat':lats, 'Long':lons}).reset_index().rename(columns={'index':'Cluster ID'})

    return jobs_with_cluster,clusters

run with

radius_km = 2
min_samples = 1 //want to keep outliers
jobs,clusters = cluster_with_dbscan(jobs, radius_km , min_samples )

When running I do get clustered data, but the clusters contain jobs that are far more than 2km apart (some clusters have jobs spanning 100s of kilometres). From my understanding of DBSCAN they should only be at most 2km from the core point

Is my understanding of DBSCAN wrong? can clusters cover areas greater than equivalent epsilon value? If so is there a more appropriate cluster algorithm?

Or is my implementation of DBSCAN flawed in someway?

Alex · Accepted Answer

You understanding of DBSCAN is wrong

DBSCAN will cluster by chaining items together to form a larger continuous group where that is at least min_samples number of points radius_km away. Note that not all points of the cluster need to be radius_km away from the centroid

So in your example two points from the same cluster can be 100km apart, as long as there is chain of at least 50 points each at most 2km apart between them

An algorithm that more closely fits your requirements is Canopy Clustering. Unfortunatly it is not included in most of the libraries, but a simple implementation can be found at https://gist.github.com/gdbassett/528d816d035f2deaaca1. If you want items to belong only a single cluster then set T1=T2=2km

DBSCAN clustering geolocations beyond the epsilon value

Answers (1)

Related Questions