rahul verma
rahul verma

Reputation: 1

I want to make cluster of sentences but now i don't know that how many cluster will be made

I have calculated the embedding with the help of doc2vec and I have also calculated the distance between sentences in vector form. now I have a vector of sentences that tells the distance between them(sentences). how can I cluster them without giving the number of clusters? I have used k-means and agglomerative algo but they are not giving me good results. can anybody tell me the best method to determine the optimal number of clusters?

Upvotes: 0

Views: 129

Answers (1)

ASH
ASH

Reputation: 20302

Try this. If it doesn't do what you want, I have a few other code samples to share. This may be the best option. The best option to use, can change, based on the dataset that you feed into the algo.

import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
    
words = "kitten belly squooshy merley best eating google feedback face extension impressed map feedback google eating face extension climbing key".split(" ") #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))

Result:

enter image description here

Upvotes: 0

Related Questions