eashwar natarajan
eashwar natarajan

Reputation: 71

Clustering for grouping sentences and then caption the cluster with a short name

I have a series of text utterances in summary form (form of sentences). I am trying to perform clustering and group them with similarity in context (not in literal meaning) and report the clusters with the common group items. The below code is what I have written in python.

import os
import time

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "My internet speed is very slow. There is no signal available from the router",
    "Close the TV and Broadband subscription. I want to move to another plan",
    "Broadband is not up to the mark. I could not work with the current speed that is offered",
    "My Bills for the current month have shot up by 200%. Can you take just 50% of it for the current month and credit the remaining as backlog for next month.",
    "I am not happy with the service. Please deactivate my subscription and I don't need the services anymore",
    "I wanted to make the bill payment, but my card is not accepted for payment through your portal. Please resolve",
    "Why is my Bill showing up extra charges of 48$ this month. I never purchased anything additional",
    "I have already made the payment for the previous month. Why is that showing up as arrears this month",
    "John is the most loyal professional footballer ever",
    "This is the most amazing place that I have ever visited!"
] 

# Embeddings using Sentence Transformers - all-MiniLM-L6-v2
embeddings = model.encode(sentences, show_progress_bar=True, convert_to_tensor=True)

# Clustering process


print("Start clustering")
start_time = time.time()

# Two parameters to tune:
# min_cluster_size: Only consider cluster that have at least n elements
# threshold: Consider sentence pairs with a cosine-similarity larger than threshold as similar
clusters = util.community_detection(embeddings, min_community_size=1, threshold=0.43)

print(f"Clustering done after {time.time() - start_time:.2f} sec")

# Print for all clusters the top n and bottom n elements
for i, cluster in enumerate(clusters):
    print(f"\nCluster {i + 1}, #{len(cluster)} Elements ")
    for sentence_id in cluster:
        print("\t", sentences[sentence_id])
    # print("\t", "...")
    # for sentence_id in cluster[3:]:
    #     print("\t", sentences[sentence_id])

I get the output as below

Start clustering
Clustering done after 0.00 sec

Cluster 1, #4 Elements 
     I have already made the payment for the previous month. Why is that showing up as arrears this month
     Why is my Bill showing up extra charges of 48$ this month. I never purchased anything additional
     I wanted to make the bill payment but my card is not accepted for payment through your portal. Please resolve
     My Bills for the current month has shot up by 200%. Can you take just 50% of it for the current month and credit the remaining as backlog for next month.

Cluster 2, #2 Elements 
     My internet speed is very slow. There is no signal available from the router
     Broadband is not upto the mark. I could not work with the current speed that is offered

Cluster 3, #2 Elements 
     Close the TV and Broadband subscription. I want to move to another plan
     I am not happy with the service. Please deactivate my subscription and I dont need the servcies anymore

Cluster 4, #1 Elements 
     John is the most loyal professional footballer ever

Cluster 5, #1 Elements 
     This is the most amazing place that I have ever visited!

My main question is that when I kept the threshold to a value of 0.8 and above, I presumed it will give me the right results but actually, it didn't. When I made it close to ~0.45, then I got this result. This is just a simple experiment. So, should I vary this threshold depending on the new dataset?

Also, I would want to know on how to group some of the general messages / sentences under 'General category'. Is there a mechanism to cluster them all under a 'General category'

I also want some advice on how to caption these clusters with a name - such as Billing, Plan upgrade, Network speed performance etc.

Upvotes: 1

Views: 31

Answers (0)

Related Questions