Problems with HDBSCAN and approximate predict

I would like to use the HDBSCAN clustering technique to predict outliers. I have trained my model to optimize the parameters, but then, when I apply approximate_predict on new data, I get different clusters and labels that I have in my original model. I will explain here the process flow.

I have a dataset that looks like this:

enter image description here

I should be noticed that this dataset has outliers artificially added by me, with the objective of optimizing the parameters. Then, I apply:

clusterer = hdbscan.HDBSCAN(min_cluster_size=10, gen_min_span_tree=True, 
                            cluster_selection_epsilon=0.1,min_samples=1,allow_single_cluster=True, prediction_data=True, leaf_size=30)
clusterer.fit(X_scaled)

Obtaining three clusters (including the outliers -1 cluster):

enter image description here

Here you can see how the clustering looks like:

enter image description here

After this, I create a dataframe that I called "new_observation", which is actually some random observations taken from the original dataset, and I apply:

test_labels, strengths = hdbscan.approximate_predict(clusterer, new_observation)
test_labels

Here, my test labels looks like: array([ -1, 56, 150, -1])

Which means that from these observations, it detects two outliers, and two observations assigned to clusters that I do no have.

Moreover, taking a look at the plotting like:

from matplotlib import cm
cmap = cm.get_cmap('Set1')
plt.scatter(x='wind_speed',y='temperature',data=X_scaled, c=clusterer.labels_, cmap=cmap)
plt.scatter(x='wind_speed',y='temperature',data=new_observation, c=test_labels, cmap=cmap, s=120)
plt.show()

enter image description here

We can observe that we have outliers where we should not have.

I really do not know how the approximate_predict is doing my clustering, but it seems is not working, someone could please help me???

Thank you!!!!

Upvotes: 4

Views: 4629

Answers (1)

Peder Ward
Peder Ward

Reputation: 89

I had the same problem as well. Remove cluster_selection_epsilon as a parameter and only use min_samples and min_cluster_size to tune the clustering. It worked for me.

Upvotes: 1

Related Questions