Reputation: 921
I am working on clustering of documents by looking at it's structure.
I have extracted the structure in BERT embeddings variable X in the code below.
What I am trying:
for num_clusters in np.arange(2,200):
model = KMeans(n_clusters=num_clusters)
model.fit(X)
pred = model.predict(X)
centers = model.cluster_centers_
cluster_sum = 0
for i , c in enumerate(centers):
use = []
for j , p in enumerate(pred):
if p == i:
use.append(X[j])
score = 0
for m in range(len(use)):
for n in range(len(use)):
score+=cos_similarity(use[m],use[n])
score = score/(len(use)*len(use))
cluster_sum += score
cluster_sum=cluster_sum/num_clusters
I have written this code to find the similarity score of the cluster(combining similarity scores of all the clusters). Problem I am facing : with the increase in number of clusters the score is increasing.
How can I find the optimum number of clusters? This plot is for the Knee algorithm suggessted by @Cyrus in the answers. I am not able to see where should I draw the line.
Upvotes: 2
Views: 259
Reputation: 971
There are quite a few topics to point you the in the right direction. You can look into a few like :
Hope this helps!
Upvotes: 3
Reputation: 699
My answer addresses more the mathematical side of your question:
The implementation of sklearn
's KMeans
uses Euclidean distance to measure the dissimilarity between data points in input data. However you seem to be trying to evaluate clustering quality with cosine similarity — a different distance measure clustering result has been optimized for. This could explain the increase in cluster score as the number of clusters increase.
Have you noticed that KMeans
has inertia_
attribute which corresponds to sum of squared distances of samples to their closest cluster center; this can be considered as a valid cluster score for KMeans
using Euclidean distance.
I am glad if this helps you!
Upvotes: 1