Reputation: 4608
I am using the following code to cluster my word vectors using k-means clustering algorithm.
from sklearn import cluster
model = word2vec.Word2Vec.load("word2vec_model")
X = model[model.wv.vocab]
clusterer = cluster.KMeans (n_clusters=6)
preds = clusterer.fit_predict(X)
centers = clusterer.cluster_centers_
Given a word in the word2vec vocabulary (e.g., word_vector = model['jeep']
) I want to get its cluster ID and cosine distance to its cluster center.
I tried the following approach.
for i,j in enumerate(set(preds)):
positions = X[np.where(preds == i)]
print(positions)
However, it returns all the vectors in each cluster ID and not exactly what I am looking for.
I am happy to provide more details if needed.
Upvotes: 4
Views: 1511
Reputation: 16966
Here is my attempt!
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
from sklearn.cluster import KMeans
clustering_model = KMeans(n_clusters=2)
preds = clustering_model.fit_predict([model.wv.get_vector(w) for w in model.wv.vocab])
To get the prediction for cluster ID
>>> clustering_model.predict([model.wv.get_vector('computer')])
# array([1], dtype=int32)
To get cosine similarity between given word and cluster centers
>>> from sklearn.metrics.pairwise import cosine_similarity
>>> cosine_similarity(clustering_model.cluster_centers_, [model.wv.get_vector('computer')])
# array([[-0.07410881],
[ 0.34881588]])
Upvotes: 1
Reputation: 31659
After clustering you get the labels_
for all of your input data (in the same order as your input data), i.e. clusterer.labels_[model.wv.vocab['jeep'].index]
would give you the cluster to which jeep
belongs.
You can calculate the cosine distance with with scipy.spatial.distance.cosine
cluster_index = clusterer.labels_[model.wv.vocab['jeep'].index]
print(distance.cosine(model['jeep'], centers[cluster_index]))
>> 0.6935321390628815
Full code
I don't know how your model looks like but let's use GoogleNews-vectors-negative300.bin
.
from gensim.models import KeyedVectors
from sklearn import cluster
from scipy.spatial import distance
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
# let's use a subset to accelerate clustering
X = model[model.wv.vocab][:40000]
clusterer = cluster.KMeans (n_clusters=6)
preds = clusterer.fit_predict(X)
centers = clusterer.cluster_centers_
cluster_index = clusterer.labels_[model.wv.vocab['jeep'].index]
print(cluster_index, distance.cosine(model['jeep'], centers[cluster_index]))
Upvotes: 4