dre_84w934
dre_84w934

Reputation: 738

Unsupervised learning clustering 1D array

I am faced with the following array:

y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]

What I would like to do is extract the cluster with the highest scores. That would be

best_cluster = [200,297,275,243]

I have checked quite a few questions on stack on this topic and most of them recommend using kmeans. Although a few others mention that kmeans might be an overkill for 1D arrays clustering. However kmeans is a supervised learnig algorithm, hence this means that I would have to pass in the number of centroids. As I need to generalize this problem to other arrays, I cannot pass the number of centroids for each one of them. Therefore I am looking at implementing some sort of unsupervised learning algorithm that would be able to figure out the clusters by itself and select the highest one. In array y I would see 3 clusters as so [1,2,4,7,9,5,4,7,9],[56,57,54,60],[200,297,275,243]. What algorithm would best fit my needs, considering computation cost and accuracy and how could I implement it for my problem?

Upvotes: 0

Views: 11679

Answers (3)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Clustering is overkill here

Just compute the differences of subsequent elements. I.e. look at x[i]-x[i-1].

Choose the k largest differences as split points. Or define a threshold on when to split. E.g. 20. Depends on your data knowledge.

This is O(n), much faster than all the others mentioned. Also very understandable and predictable.

On one dimensional ordered data, any method that doesn't use the order will be slower than necessary.

Upvotes: 5

EasonL
EasonL

Reputation: 853

Try MeanShift. From the sklean user guide of MeanShift:

The algorithm automatically sets the number of clusters, ...

Modified demo code:

import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth

# #############################################################################
# Generate sample data
X = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
X = np.reshape(X, (-1, 1))

# #############################################################################
# Compute clustering with MeanShift

# The following bandwidth can be automatically detected using
# bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=100)

ms = MeanShift(bandwidth=None, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)
print(labels)

Output:

number of estimated clusters : 2
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]

Note that MeanShift is not scalable with the number of samples. The recommended upper limit is 10,000.


BTW, as rahlf23 already mentioned, K-mean is an unsupervised learning algorithm. The fact that you have to specify the number of clusters does not mean it is supervised.

See also:

Upvotes: 9

Edward Khachatryan
Edward Khachatryan

Reputation: 469

HDBSCAN is the best clustering algorithm and you should always use it.

Basically all you need to do is provide a reasonable min_cluster_size, a valid distance metric and you're good to go.

For min_cluster_size I suggest using 3 since a cluster of 2 is lame and for metric the default euclidean works great so you don't even need to mention it.

Don't forget that distance metrics apply to vectors and here we have scalars so some ugly reshaping is in order.

To put it all together and assuming by "cluster with the highest scores" you mean the cluster that includes the max value we get:

from hdbscan import HDBSCAN
import numpy as np

y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
y = np.reshape(y, (-1, 1))

clusterer = HDBSCAN(min_cluster_size=3)
cluster_labels = clusterer.fit_predict(y)

best_cluster = clusterer.exemplars_[cluster_labels[y.argmax()]].ravel()
print(best_cluster)

The output is [297 200 275 243]. Original order is not preserved. C'est la vie.

Upvotes: -2

Related Questions