Reputation: 71
I have a series of text utterances in summary form (form of sentences). I am trying to perform clustering and group them with similarity in context (not in literal meaning) and report the clusters with the common group items. The below code is what I have written in python.
import os
import time
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"My internet speed is very slow. There is no signal available from the router",
"Close the TV and Broadband subscription. I want to move to another plan",
"Broadband is not up to the mark. I could not work with the current speed that is offered",
"My Bills for the current month have shot up by 200%. Can you take just 50% of it for the current month and credit the remaining as backlog for next month.",
"I am not happy with the service. Please deactivate my subscription and I don't need the services anymore",
"I wanted to make the bill payment, but my card is not accepted for payment through your portal. Please resolve",
"Why is my Bill showing up extra charges of 48$ this month. I never purchased anything additional",
"I have already made the payment for the previous month. Why is that showing up as arrears this month",
"John is the most loyal professional footballer ever",
"This is the most amazing place that I have ever visited!"
]
# Embeddings using Sentence Transformers - all-MiniLM-L6-v2
embeddings = model.encode(sentences, show_progress_bar=True, convert_to_tensor=True)
# Clustering process
print("Start clustering")
start_time = time.time()
# Two parameters to tune:
# min_cluster_size: Only consider cluster that have at least n elements
# threshold: Consider sentence pairs with a cosine-similarity larger than threshold as similar
clusters = util.community_detection(embeddings, min_community_size=1, threshold=0.43)
print(f"Clustering done after {time.time() - start_time:.2f} sec")
# Print for all clusters the top n and bottom n elements
for i, cluster in enumerate(clusters):
print(f"\nCluster {i + 1}, #{len(cluster)} Elements ")
for sentence_id in cluster:
print("\t", sentences[sentence_id])
# print("\t", "...")
# for sentence_id in cluster[3:]:
# print("\t", sentences[sentence_id])
I get the output as below
Start clustering
Clustering done after 0.00 sec
Cluster 1, #4 Elements
I have already made the payment for the previous month. Why is that showing up as arrears this month
Why is my Bill showing up extra charges of 48$ this month. I never purchased anything additional
I wanted to make the bill payment but my card is not accepted for payment through your portal. Please resolve
My Bills for the current month has shot up by 200%. Can you take just 50% of it for the current month and credit the remaining as backlog for next month.
Cluster 2, #2 Elements
My internet speed is very slow. There is no signal available from the router
Broadband is not upto the mark. I could not work with the current speed that is offered
Cluster 3, #2 Elements
Close the TV and Broadband subscription. I want to move to another plan
I am not happy with the service. Please deactivate my subscription and I dont need the servcies anymore
Cluster 4, #1 Elements
John is the most loyal professional footballer ever
Cluster 5, #1 Elements
This is the most amazing place that I have ever visited!
My main question is that when I kept the threshold to a value of 0.8 and above, I presumed it will give me the right results but actually, it didn't. When I made it close to ~0.45, then I got this result. This is just a simple experiment. So, should I vary this threshold depending on the new dataset?
Also, I would want to know on how to group some of the general messages / sentences under 'General category'. Is there a mechanism to cluster them all under a 'General category'
I also want some advice on how to caption these clusters with a name - such as Billing, Plan upgrade, Network speed performance etc.
Upvotes: 1
Views: 31