ziiho_
ziiho_

Reputation: 43

How perform unsupervised clustering on numbers in an Array using PyTorch

I got this array and I want to cluster/group the numbers into similar values.

An example of input array:

array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106] 

expected result :

array([57,58,59,60,61]), ([78,79,80,81,82,83]), ([101,102,103,104,105,106]) 

I tried to use clustering but I don't think it's gonna work if I don't know how many I'm going to split up.

true = np.where(array>=1)
-> (array([ 57,  58,  59,  60,  61,  78,  79,  80,  81,  82,  83, 101, 102,
    103, 104, 105, 106], dtype=int64),)

Upvotes: 4

Views: 367

Answers (2)

Sadra Sabouri
Sadra Sabouri

Reputation: 309

You can perform kind of derivation on this array so that you can track changes better, assume your array is:

A = np.array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106])

so you can make a derivation vector by simply convolving your vector with [-1 1]:

A_ = abs(np.convolve(A, np.array([-1, 1])))

then A_ is:

array([57, 1, 1, 1, 1, 17, 1, 1, 1, 1, 1, 18, 2, 1, 1, 1, 106]

now you can define a threshold like 5 and find the cluster boundaries.

THRESHOLD = 5
cluster_bounds = np.argwhere(A_ > THRESHOLD)

now cluster_bounds is:

array([[0], [5], [11], [16]], dtype=int32)

Upvotes: 1

Akshay Sehgal
Akshay Sehgal

Reputation: 19307

Dynamic binning requires explicit criteria and is not an easy problem to automate because each array may require a different set of thresholds to bin them efficiently.

I think Gaussian mixtures with a silhouette score criteria is the best bet you have. Here is a code for what you are trying to achieve. The silhouette scores help you determine the number of clusters/Gaussians you should use and is quite accurate and interpretable for 1D data.

import numpy as np
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import scipy
import matplotlib.pyplot as plt
%matplotlib inline

#Sample data
x = [57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106]

#Fit a model onto the data
data = np.array(x).reshape(-1,1)

#change value of clusters to check best silhoutte score
print('Silhoutte scores')
scores = []
for n in range(2,11):
    model = GaussianMixture(n).fit(data)
    preds = model.predict(data)
    score = silhouette_score(data, preds)
    scores.append(score)
    print(n,'->',score)

n_best = np.argmax(scores)+2 #because clusters start from 2

model = GaussianMixture(n_best).fit(data) #best model fit
    
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))

#Plotting
extend_window = 50  #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis

#plot the different distributions (in this case 2 of them)
for i in range(num_components):
    y_values = scipy.stats.norm(mu[i], sd[i])
    plt.plot(x_values, y_values.pdf(x_values))

#split data by clusters
pred = model.predict(data)
output = np.split(x, np.sort(np.unique(pred, return_index=True)[1])[1:])
print(output)
Silhoutte scores
2 -> 0.699444729378163
3 -> 0.8962176943475543  #<--- selected as nbest
4 -> 0.7602523591781903
5 -> 0.5835620702692205
6 -> 0.5313888070615105
7 -> 0.4457049486461251
8 -> 0.4355742296918767
9 -> 0.13725490196078433
10 -> 0.2159663865546218

This creates 3 gaussians with the following distributions to split the data into clusters.

enter image description here

Arrays output finally split by similar values

#output - 
[array([57, 58, 59, 60, 61]),
 array([78, 79, 80, 81, 82, 83]),
 array([101, 102, 103, 104, 105, 106])]

Upvotes: 1

Related Questions