Can Kmeans Clustering using cosine distance in sklearn?

I want to clustering my document using BERT embedding from Sentence Transoformer especially bert-base-nli-mean tokens, and i want to cluster that embedding with kmeans clustering but i have a problem, can i using kmeans clustering using cosine distance?

solution and the code for this problem?

Upvotes: 0

Answers (2)

rm-star

Reputation: 1

KMeans algorithm is using the squared Euclidean distance.

This implies that, if you normalized your input vectors, the returned distance is exactly the double of the cosine distance. I let you check the link.

from squared euclidean to cosine distance

Upvotes: 0

Krish

Reputation: 136

Yes, you can use K-Means clustering with BERT embeddings obtained from Sentence Transformers like bert-base-nli-mean-tokens. However, the standard implementation of K-Means in libraries like scikit-learn uses Euclidean distance, not cosine distance. To cluster embeddings using cosine distance, you have a few option.

!pip install sentence-transformers scikit-learn
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans

# Load BERT model
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Your document texts
documents = ["Document 1 text...", "Document 2 text...", "..."]

# Generate BERT embeddings
embeddings = model.encode(documents)

# Normalize the embeddings
normalized_embeddings = normalize(embeddings)

# Define the K-Means model
num_clusters = 5  # Adjust the number of clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)

# Fit the model
kmeans.fit(normalized_embeddings)

# Get cluster labels
labels = kmeans.labels_

# Output the cluster labels for your documents
print(labels)

Upvotes: 0

Can Kmeans Clustering using cosine distance in sklearn?

Answers (2)

Related Questions