Reputation: 1
I want to clustering my document using BERT embedding from Sentence Transoformer especially bert-base-nli-mean tokens, and i want to cluster that embedding with kmeans clustering but i have a problem, can i using kmeans clustering using cosine distance?
solution and the code for this problem?
Upvotes: 0
Views: 725
Reputation: 1
KMeans algorithm is using the squared Euclidean distance.
This implies that, if you normalized your input vectors, the returned distance is exactly the double of the cosine distance. I let you check the link.
from squared euclidean to cosine distance
Upvotes: 0
Reputation: 136
Yes, you can use K-Means clustering with BERT embeddings obtained from Sentence Transformers like bert-base-nli-mean-tokens. However, the standard implementation of K-Means in libraries like scikit-learn uses Euclidean distance, not cosine distance. To cluster embeddings using cosine distance, you have a few option.
!pip install sentence-transformers scikit-learn
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans
# Load BERT model
model = SentenceTransformer('bert-base-nli-mean-tokens')
# Your document texts
documents = ["Document 1 text...", "Document 2 text...", "..."]
# Generate BERT embeddings
embeddings = model.encode(documents)
# Normalize the embeddings
normalized_embeddings = normalize(embeddings)
# Define the K-Means model
num_clusters = 5 # Adjust the number of clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
# Fit the model
kmeans.fit(normalized_embeddings)
# Get cluster labels
labels = kmeans.labels_
# Output the cluster labels for your documents
print(labels)
Upvotes: 0