How to find the meaningful word to represent each k-means cluster derived from word2vec vectors?

Question

I used the gensim package in Python to load the pre-trained Google word2vec dataset. I then want to use k-means to find meaningful clusters on my word vectors, and find the representative word for each cluster. I am thinking to use the word whose corresponding vector is closest to the centroid of a cluster to represent that cluster, but don't know whether this is a good idea as my experiment did not give me good results.

My example code is like below:

import gensim
import numpy as np
import pandas as pd
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import pairwise_distances_argmin_min

model = gensim.models.KeyedVectors.load_word2vec_format('/home/Desktop/GoogleNews-vectors-negative300.bin', binary=True)  

K=3

words = ["ship", "car", "truck", "bus", "vehicle", "bike", "tractor", "boat",
       "apple", "banana", "fruit", "pear", "orange", "pineapple", "watermelon",
       "dog", "pig", "animal", "cat", "monkey", "snake", "tiger", "rat", "duck", "rabbit", "fox"]
NumOfWords = len(words)

# construct the n-dimentional array for input data, each row is a word vector
x = np.zeros((NumOfWords, model.vector_size))
for i in range(0, NumOfWords):
    x[i,]=model[words[i]] 

# train the k-means model
classifier = MiniBatchKMeans(n_clusters=K, random_state=1, max_iter=100)
classifier.fit(x)

# check whether the words are clustered correctly
print(classifier.predict(x))

# find the index and the distance of the closest points from x to each class centroid
close = pairwise_distances_argmin_min(classifier.cluster_centers_, x, metric='euclidean')
index_closest_points = close[0]
distance_closest_points = close[1]

for i in range(0, K):
    print("The closest word to the centroid of class {0} is {1}, the distance is {2}".format(i, words[index_closest_points[i]], distance_closest_points[i]))

The output is as below:

[2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0]
The closest word to the centroid of class 0 is rabbit, the distance is 1.578625818679259
The closest word to the centroid of class 1 is fruit, the distance is 1.8351978219013796
The closest word to the centroid of class 2 is car, the distance is 1.6586030662247868

In the code I have 3 categories of words: vehicle, fruit and animal. From the output we can see that k-means correctly clustered the words for all 3 categories, but the representative words derived using the centroid method are not very good, as for class 0 I want to see "animal" but it gives "rabbit", and for class 2 I want to see "vehicle" but it returns "car".

Any help or suggestion in finding the good representative word for each cluster will be highly appreciated.

How to find the meaningful word to represent each k-means cluster derived from word2vec vectors?

Answers (1)

Related Questions