Darth Vader
Darth Vader

Reputation: 921

Clustering of documents with it's structure

I am working on clustering of documents by looking at it's structure.

I have extracted the structure in BERT embeddings variable X in the code below.

What I am trying:

for num_clusters in np.arange(2,200):
    model = KMeans(n_clusters=num_clusters)
    model.fit(X)
    pred = model.predict(X)
    centers = model.cluster_centers_

    cluster_sum = 0
    for i , c in enumerate(centers):
        use = []
        for j , p in enumerate(pred):
            if p == i:
                use.append(X[j])
        score = 0
        for m in range(len(use)):
            for n in range(len(use)):
                score+=cos_similarity(use[m],use[n])
        score = score/(len(use)*len(use))
        cluster_sum += score
    cluster_sum=cluster_sum/num_clusters

I have written this code to find the similarity score of the cluster(combining similarity scores of all the clusters). Problem I am facing : with the increase in number of clusters the score is increasing.

How can I find the optimum number of clusters? This plot is for the Knee algorithm suggessted by @Cyrus in the answers. I am not able to see where should I draw the line.

enter image description here

Upvotes: 2

Views: 259

Answers (2)

Cyrus Dsouza
Cyrus Dsouza

Reputation: 971

There are quite a few topics to point you the in the right direction. You can look into a few like :

  1. Elbow Method
  2. Silhouette Analysis
  3. Different Type of clustering algorithms that do not rely on giving number of clusters upfront (such as DBSCAN)

Hope this helps!

Upvotes: 3

kampmani
kampmani

Reputation: 699

My answer addresses more the mathematical side of your question:

The implementation of sklearn's KMeans uses Euclidean distance to measure the dissimilarity between data points in input data. However you seem to be trying to evaluate clustering quality with cosine similarity — a different distance measure clustering result has been optimized for. This could explain the increase in cluster score as the number of clusters increase.

Have you noticed that KMeans has inertia_ attribute which corresponds to sum of squared distances of samples to their closest cluster center; this can be considered as a valid cluster score for KMeans using Euclidean distance.

I am glad if this helps you!

Upvotes: 1

Related Questions