Reputation: 1905
Ive been looking at Geoff Boeing's excellent blog posts on DBSCAN. The page I'm most interested in is -
http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
How can I amend this approach to return the center of the largest cluster(cluster center with the surrounded by the most lat/lng points)? Is there a density rating associated with the center point of each cluster?
The core dbscan -
db = DBSCAN(eps=.01, min_samples=1).fit(coordinates)
labels = db.labels_
num_clusters = len(set(labels)) - (1 if -1 in labels else 0)
clusters = pd.Series([coordinates[labels == i] for i in xrange(num_clusters)])
print('Number of clusters: %d' % num_clusters)
Upvotes: 4
Views: 5979
Reputation: 77454
Unfortunately, that blog post is wrong in a number of key points.
Never use DBSCAN with min_samples=1
. That is single-linkage clustering. If you want single-linkage, use single-linkage, not DBSCAN. Here, Leader clustering may be a good choice, too.
Choose eps
wisely. In his example, he chose eps so small, he mostly removed (near-) duplicates...
DBSCAN clusters do not have a meaningful center. Because they can be non-convex.In particular, the center needs to take haversine distance into account, which he doesn't. The first version used the mean, the new version uses the point closest to the mean (but which may still be distorted, because the mean didn't take earth into account).
On latitude, longitude you should be using great-circle distance during clustering already, not only afterwards. (Fixed in the blog by now).
Point 3 above also answers your question: DBSCAN clusters may not have a meaningful center. The center can be outside the cluster.
Since the original post, some points (in particular #4) have been improved. DBSCAN now actually uses haversine, and a ball tree index.
Upvotes: 1
Reputation: 19
I'm working on a similar project as well and using his blog post as a guide too. The logic to return the center of the largest cluster (but be aware, the center itself may be meaningless with DBSCAN): sort the clusters by size, take the largest, calculate the centroid (using the logic provided in that blog post). Then you have a choice. You can either keep that calculated centroid as the "center point" or you can find the point in the cluster nearest to that centroid (as the author of that blog post seems to do).
Contrary to another responder, that blog post is not wrong on a number of points:
Most importantly, the results turn out exactly how they're supposed to in the blog post.
Upvotes: 1
Reputation: 157
If you are interested in representing the largest cluster as a 'central' point (e.g., dimensionality reduction), I would do the following:
Find the cluster with the greatest number of classified points:
# Assumes coordinates is a DataFrame
db = DBSCAN(eps=eps, min_samples=min_samples).fit(coordinates)
df = pd.DataFrame([coordinates.x, coordinates.y, db.labels_]).T # Add other attributes of coordinates if needed
df.columns = ['x', 'y','label']; # Add column names
max_label = df.label.mode()[0];
max_cluster = df[df['label']==max_label];
You can take the mean of each of each column
max_cluster_array = max_cluster[['x','y']].as_matrix()
print max_cluster_array.mean(axis=0) # what you are looking for
You could also looking into multivariate kernel density estimation functions if you are interested in estimating in a more robust 'central' point.
Upvotes: 1