Reputation: 469
I am a newbie in text mining, here is my situation. Suppose i have a list of words ['car', 'dog', 'puppy', 'vehicle'], i would like to cluster words into k groups, I want the output to be [['car', 'vehicle'], ['dog', 'puppy']]. I first calculate similarity score of each pairwise word to obtain a 4x4 matrix(in this case) M, where Mij is the similarity score of word i and j. After transforming the words into numeric data, i utilize different clustering library(such as sklearn) or implement it by myself to get the word clusters.
I want to know does this approach makes sense? Besides, how do I determine the value of k? More importantly, i know that there exist different clustering technique, i am thinking whether i should use k-means or k-medoids for word clustering?
Upvotes: 10
Views: 18417
Reputation: 12515
Adding on to what's already been said regarding similarity scores, finding k
in clustering applications generally is aided by scree plots (also known as an "elbow curve"). In these plots, you'll usually have some measure of dispersion between clusters on the y-axis, and the number of clusters on the x-axis. Finding the minimum (second derivative) of the curve in the scree plot gives you a more objective measure of cluster "uniqueness."
Upvotes: 1
Reputation: 88148
Following up the answer by Brian O'Donnell, once you've computed the semantic similarity with word2vec (or FastText or GLoVE, ...), you can then cluster the matrix using sklearn.clustering
. I've found that for small matrices, spectral clustering gives the best results.
It's worth keeping in mind that the word vectors are often embedded on a high-dimensional sphere. K-means with a Euclidean distance matrix fails to capture this, and may lead to poor results for the similarity of words that aren't immediate neighbors.
Upvotes: 9
Reputation: 1876
If you want to cluster words by their "semantic similarity" (i.e. likeness of their meaning) take a look at Word2Vec and GloVe. Gensim has an implementation for Word2Vec. This web page, "Word2Vec Tutorial", by Radim Rehurek gives a tutorial on using Word2Vec to determine similar words.
Upvotes: 5