farrellmr
farrellmr

Reputation: 1905

Using DBSCAN to find the most dense cluster?

Ive been looking at Geoff Boeing's excellent blog posts on DBSCAN. The page I'm most interested in is -

http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/

How can I amend this approach to return the center of the largest cluster(cluster center with the surrounded by the most lat/lng points)? Is there a density rating associated with the center point of each cluster?

The core dbscan -

db = DBSCAN(eps=.01, min_samples=1).fit(coordinates)
labels = db.labels_
num_clusters = len(set(labels)) - (1 if -1 in labels else 0)
clusters = pd.Series([coordinates[labels == i] for i in xrange(num_clusters)])
print('Number of clusters: %d' % num_clusters)

Upvotes: 4

Views: 5979

Answers (3)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Unfortunately, that blog post is wrong in a number of key points.

  1. Never use DBSCAN with min_samples=1. That is single-linkage clustering. If you want single-linkage, use single-linkage, not DBSCAN. Here, Leader clustering may be a good choice, too.

  2. Choose eps wisely. In his example, he chose eps so small, he mostly removed (near-) duplicates...

  3. DBSCAN clusters do not have a meaningful center. Because they can be non-convex.In particular, the center needs to take haversine distance into account, which he doesn't. The first version used the mean, the new version uses the point closest to the mean (but which may still be distorted, because the mean didn't take earth into account).

  4. On latitude, longitude you should be using great-circle distance during clustering already, not only afterwards. (Fixed in the blog by now).

Point 3 above also answers your question: DBSCAN clusters may not have a meaningful center. The center can be outside the cluster.

Since the original post, some points (in particular #4) have been improved. DBSCAN now actually uses haversine, and a ball tree index.

Upvotes: 1

Ricard D.
Ricard D.

Reputation: 19

I'm working on a similar project as well and using his blog post as a guide too. The logic to return the center of the largest cluster (but be aware, the center itself may be meaningless with DBSCAN): sort the clusters by size, take the largest, calculate the centroid (using the logic provided in that blog post). Then you have a choice. You can either keep that calculated centroid as the "center point" or you can find the point in the cluster nearest to that centroid (as the author of that blog post seems to do).

Contrary to another responder, that blog post is not wrong on a number of points:

  1. Single-linkage clustering is perfectly fine and useful in a number of contexts, including the one the blog post's author is using.
  2. His eps is appropriate and properly chosen for his use case, which he explains is explicitly meant to remove near-duplicates.
  3. The blog post isn't wrong about cluster centers - in fact it explicitly mentions non-convexity, and the code returns a point from the cluster rather than the cluster's center point itself.
  4. The code in the blog post does use great-circle distance via the haversine metric during DBSCAN clustering.

Most importantly, the results turn out exactly how they're supposed to in the blog post.

Upvotes: 1

Mtap1
Mtap1

Reputation: 157

If you are interested in representing the largest cluster as a 'central' point (e.g., dimensionality reduction), I would do the following:

Find the cluster with the greatest number of classified points:

# Assumes coordinates is a DataFrame
db = DBSCAN(eps=eps, min_samples=min_samples).fit(coordinates)
df = pd.DataFrame([coordinates.x, coordinates.y, db.labels_]).T # Add other attributes of coordinates if needed
df.columns = ['x', 'y','label']; # Add column names
max_label = df.label.mode()[0];

max_cluster = df[df['label']==max_label];

You can take the mean of each of each column

max_cluster_array = max_cluster[['x','y']].as_matrix()
print max_cluster_array.mean(axis=0) # what you are looking for

You could also looking into multivariate kernel density estimation functions if you are interested in estimating in a more robust 'central' point.

Upvotes: 1

Related Questions