Reputation: 43
I got this array and I want to cluster/group the numbers into similar values.
An example of input array:
array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106]
expected result :
array([57,58,59,60,61]), ([78,79,80,81,82,83]), ([101,102,103,104,105,106])
I tried to use clustering but I don't think it's gonna work if I don't know how many I'm going to split up.
true = np.where(array>=1)
-> (array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102,
103, 104, 105, 106], dtype=int64),)
Upvotes: 4
Views: 367
Reputation: 309
You can perform kind of derivation on this array so that you can track changes better, assume your array is:
A = np.array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106])
so you can make a derivation vector by simply convolving your vector with [-1 1]
:
A_ = abs(np.convolve(A, np.array([-1, 1])))
then A_
is:
array([57, 1, 1, 1, 1, 17, 1, 1, 1, 1, 1, 18, 2, 1, 1, 1, 106]
now you can define a threshold like 5
and find the cluster boundaries.
THRESHOLD = 5
cluster_bounds = np.argwhere(A_ > THRESHOLD)
now cluster_bounds
is:
array([[0], [5], [11], [16]], dtype=int32)
Upvotes: 1
Reputation: 19307
Dynamic binning requires explicit criteria and is not an easy problem to automate because each array may require a different set of thresholds to bin them efficiently.
I think Gaussian mixtures
with a silhouette score criteria
is the best bet you have. Here is a code for what you are trying to achieve. The silhouette scores help you determine the number of clusters/Gaussians you should use and is quite accurate and interpretable for 1D data.
import numpy as np
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106]
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
#change value of clusters to check best silhoutte score
print('Silhoutte scores')
scores = []
for n in range(2,11):
model = GaussianMixture(n).fit(data)
preds = model.predict(data)
score = silhouette_score(data, preds)
scores.append(score)
print(n,'->',score)
n_best = np.argmax(scores)+2 #because clusters start from 2
model = GaussianMixture(n_best).fit(data) #best model fit
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))
#split data by clusters
pred = model.predict(data)
output = np.split(x, np.sort(np.unique(pred, return_index=True)[1])[1:])
print(output)
Silhoutte scores
2 -> 0.699444729378163
3 -> 0.8962176943475543 #<--- selected as nbest
4 -> 0.7602523591781903
5 -> 0.5835620702692205
6 -> 0.5313888070615105
7 -> 0.4457049486461251
8 -> 0.4355742296918767
9 -> 0.13725490196078433
10 -> 0.2159663865546218
This creates 3 gaussians with the following distributions to split the data into clusters.
Arrays output finally split by similar values
#output -
[array([57, 58, 59, 60, 61]),
array([78, 79, 80, 81, 82, 83]),
array([101, 102, 103, 104, 105, 106])]
Upvotes: 1