Carter
Carter

Reputation: 1623

How to find the meaningful word to represent each k-means cluster derived from word2vec vectors?

I used the gensim package in Python to load the pre-trained Google word2vec dataset. I then want to use k-means to find meaningful clusters on my word vectors, and find the representative word for each cluster. I am thinking to use the word whose corresponding vector is closest to the centroid of a cluster to represent that cluster, but don't know whether this is a good idea as my experiment did not give me good results.

My example code is like below:

import gensim
import numpy as np
import pandas as pd
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import pairwise_distances_argmin_min

model = gensim.models.KeyedVectors.load_word2vec_format('/home/Desktop/GoogleNews-vectors-negative300.bin', binary=True)  

K=3

words = ["ship", "car", "truck", "bus", "vehicle", "bike", "tractor", "boat",
       "apple", "banana", "fruit", "pear", "orange", "pineapple", "watermelon",
       "dog", "pig", "animal", "cat", "monkey", "snake", "tiger", "rat", "duck", "rabbit", "fox"]
NumOfWords = len(words)

# construct the n-dimentional array for input data, each row is a word vector
x = np.zeros((NumOfWords, model.vector_size))
for i in range(0, NumOfWords):
    x[i,]=model[words[i]] 

# train the k-means model
classifier = MiniBatchKMeans(n_clusters=K, random_state=1, max_iter=100)
classifier.fit(x)

# check whether the words are clustered correctly
print(classifier.predict(x))

# find the index and the distance of the closest points from x to each class centroid
close = pairwise_distances_argmin_min(classifier.cluster_centers_, x, metric='euclidean')
index_closest_points = close[0]
distance_closest_points = close[1]

for i in range(0, K):
    print("The closest word to the centroid of class {0} is {1}, the distance is {2}".format(i, words[index_closest_points[i]], distance_closest_points[i]))

The output is as below:

[2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0]
The closest word to the centroid of class 0 is rabbit, the distance is 1.578625818679259
The closest word to the centroid of class 1 is fruit, the distance is 1.8351978219013796
The closest word to the centroid of class 2 is car, the distance is 1.6586030662247868

In the code I have 3 categories of words: vehicle, fruit and animal. From the output we can see that k-means correctly clustered the words for all 3 categories, but the representative words derived using the centroid method are not very good, as for class 0 I want to see "animal" but it gives "rabbit", and for class 2 I want to see "vehicle" but it returns "car".

Any help or suggestion in finding the good representative word for each cluster will be highly appreciated.

Upvotes: 5

Views: 1896

Answers (1)

gojomo
gojomo

Reputation: 54163

It sounds like you're hoping to be able to find a generic term for the words in the cluster – sort of a hypernym – with an automated process, and were hoping that the centroid would be that term.

Unfortunately, I've not seen any claims word2vec winds up arranging words that way. Words do tend to be close to other words that could fill-in for them – but there really aren't any guarantees all words of shared type are closer to each other than other types of words, or that the hyponyms tend to be be equidistant to their hyponyms, or so on. (It's certainly possible given the success of word2vec in analogy-solving that hypernyms tend to be offset from their hyponyms in a vaguely similar direction across classes. That is, perhaps vaguely 'volkswagen' + ('animal' - 'dog') ~ 'car' – though I haven't checked.)

There's an interesting observation sometimes made about word-vectors that could be relevant: word-vectors for words with more diffuse meaning – such as multiple senses – often tend to have lower magnitudes, in their raw form, than other word-vectors for words with more singular meanings. The usual most-similar calculations ignore the magnitudes, just comparing the raw directions, but a search for more-generic terms might want to favor lower-magnitude vectors. But this is also just a guess I haven't checked.

You could look up work on automated hypernym/hyponym discovery, and it's possible word2vec vectors could be a contributing factor to such discovery processes – either trained in the normal way, or with some new wrinkles to try to force the desired arrangement. (But, such specializations aren't generally supported by gensim out-of-the-box.)

There are often papers that refine the word2vec training process to make the vectors better for particular purposes. One recent paper from Facebook Research that seems relevant is "Poincaré Embeddings for Learning Hierarchical Representations" – which reports better modeling of hierarchies and specifically tests on the noun hypernym graph of WordNet.

Upvotes: 5

Related Questions