skywalker
skywalker

Reputation: 318

Number of clusters increased with the increase of MinPts in scikit-learn DBSCAN

I use DBSCAN implementation from scikit-learn library and I got strange results. The number of estimated clusters increased with the increase of parameter MinPts (min_samples) and from my understanding of algorithm this should not happend.

Here are my results:

Estimated number of clusters:34 eps=0.9 min_samples=13.0
Estimated number of clusters:35 eps=0.9 min_samples=12.0
Estimated number of clusters:42 eps=0.9 min_samples=11.0 <- strange result here
Estimated number of clusters:37 eps=0.9 min_samples=10.0   
Estimated number of clusters:53 eps=0.9 min_samples=9.0
Estimated number of clusters:63 eps=0.9 min_samples=8.0

I use scikit-learn like this:

X = StandardScaler().fit_transform(X)
db = DBSCAN(eps=eps, min_samples=min_samples, algorithm='kd_tree').fit(X)

and X is an array that contains ~200k 12-dimensional points.

What can be the problem here?

Upvotes: 3

Views: 3129

Answers (1)

Fred Foo
Fred Foo

Reputation: 363807

DBSCAN divides points/samples into three categories:

  1. Core: lives in a dense neighborhood and may therefore give rise to a cluster. min_samples in scikit-learn's implementation is the neighborhood density parameter.
  2. Density-reachable: close enough to a core point to be part of its cluster.
  3. Outliers: all the rest.

Now, as you require a denser neighborhood for core points, you get fewer core points, but a core point x losing its status can have three effects depending on the density just outside its neighborhood:

  1. x is still density-reachable from the core points of its former cluster and the remaining core points are able to hold the cluster together. The number of clusters is unchanged.
  2. x is still density-reachable from at least two core points, but no longer acts as a density-connecting "bridge" between the core points, causing them to form separate clusters. The number of clusters increases and x is assigned to another point's cluster.
  3. Neither x, nor its neighbor points are able to sustain their former cluster and it disappears, leaving x as an outlier. The number of clusters decreases.

Upvotes: 9

Related Questions