Reputation: 1549
I have recently been introduced to word2vec and I'm having some trouble trying to figure out how exactly it is used for k-means clustering.
I do understand how k-means works with tf-idf vectors. For each text document you have a vector of tf-idf values and after choosing some documents as initial cluster centers, you can use the euclidian distance to minimise the the distances between the vectors of the documents. Here's an example.
However, when using word2vec, each word is represented as a vector. Does this mean that each document corresponds to a matrix? And if so, how do you compute the minimum distance w.r.t. other text documents?
Question: How do you compute the distance between text documents for k-means with word2vec?
Edit: To explain my confusion in a bit more detail, please consider the following code:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sentences_tfidf)
print(tfidf_matrix.toarray())
model = Word2Vec(sentences_word2vec, min_count=1)
word2vec_matrix = model[model.wv.vocab]
print(len(word2vec_matrix))
for i in range(0,len(word2vec_matrix)):
print(X[i])
It returns the following code:
[[ 0. 0.55459491 0. 0. 0.35399075 0. 0.
0. 0. 0. 0. 0. 0. 0.437249
0.35399075 0.35399075 0.35399075 0. ]
[ 0. 0. 0. 0.44302215 0.2827753 0. 0.
0. 0.34928375 0. 0. 0. 0.34928375
0. 0.2827753 0.5655506 0.2827753 0. ]
[ 0. 0. 0.35101741 0. 0. 0.27674616
0.35101741 0. 0. 0.35101741 0. 0.35101741
0.27674616 0.27674616 0.44809973 0. 0. 0.27674616]
[ 0.40531999 0. 0. 0. 0.2587105 0.31955894
0. 0.40531999 0.31955894 0. 0.40531999 0. 0.
0. 0. 0.2587105 0.2587105 0.31955894]]
20
[ 4.08335682e-03 -4.44161100e-03 3.92342824e-03 3.96498619e-03
6.99949533e-06 -2.14108804e-04 1.20419310e-03 -1.29191438e-03
1.64671184e-03 3.41688609e-03 -4.94929403e-03 2.90348311e-03
4.23802016e-03 -3.01274913e-03 -7.36164337e-04 3.47558968e-03
-7.02908786e-04 4.73567843e-03 -1.42914290e-03 3.17237526e-03
9.36070050e-04 -2.23833631e-04 -4.03443904e-04 4.97530040e-04
-4.82502300e-03 2.42140982e-03 -3.61089432e-03 3.37070058e-04
-2.09900597e-03 -1.82093668e-03 -4.74618562e-03 2.41499138e-03
-2.15628324e-03 3.43719614e-03 7.50159554e-04 -2.05973233e-03
1.92534993e-03 1.96503079e-03 -2.02400610e-03 3.99564439e-03
4.95056808e-03 1.47033704e-03 -2.80071306e-03 3.59585625e-04
-2.77896033e-04 -3.21732066e-03 4.36303904e-03 -2.16396619e-03
2.24438333e-03 -4.50925855e-03 -4.70488053e-03 6.30825118e-04
3.81869613e-03 3.75767215e-03 5.01064525e-04 1.70175335e-03
-1.26033701e-04 -7.43318116e-04 -6.74833194e-04 -4.76678275e-03
1.53754558e-03 2.32421421e-03 -3.23472451e-03 -8.32759659e-04
4.67014220e-03 5.15853462e-04 -1.15449808e-03 -1.63017167e-03
-2.73897988e-03 -3.95627553e-03 4.04657237e-03 -1.79282576e-03
-3.26930732e-03 2.85121426e-03 -2.33304151e-03 -2.01760884e-03
-3.33597139e-03 -1.19233003e-03 -2.12347694e-03 4.36858647e-03
2.00414215e-03 -4.23572073e-03 4.98410035e-03 1.79121632e-03
4.81655030e-03 3.33247939e-03 -3.95260006e-03 1.19335402e-03
4.61675343e-04 6.09758368e-04 -4.74696746e-03 4.91552567e-03
1.74517138e-03 2.36604619e-03 -3.06009664e-04 3.62954312e-03
3.56943789e-03 2.92139384e-03 -4.27138479e-03 -3.51175456e-03]
[ -4.14272398e-03 3.45513038e-03 -1.47538856e-04 -2.02292087e-03
-2.96578306e-04 1.88684417e-03 -2.63865804e-03 2.69249966e-03
4.57606697e-03 2.19206396e-03 2.01336667e-03 1.47434452e-03
1.88332598e-03 -1.14452699e-03 -1.35678309e-03 -2.02636060e-04
-3.26160830e-03 -3.95368552e-03 1.40415027e-03 2.30542314e-03
-3.18884710e-03 -4.46776347e-03 3.96415358e-03 -2.07852037e-03
4.98413946e-03 -6.43568579e-04 -2.53325375e-03 1.30117545e-03
1.26555841e-03 -8.84680718e-04 -8.34991166e-04 -4.15050285e-03
4.66807076e-04 1.71844949e-04 1.08140183e-03 4.37910948e-03
-3.28412466e-03 2.09890743e-04 2.29888223e-03 4.70223464e-03
-2.31004297e-03 -5.10134443e-04 2.57104915e-03 -2.55978899e-03
-7.55646848e-04 -1.98197929e-04 1.20443532e-04 4.63618943e-03
1.13036349e-05 8.16594984e-04 -1.65917678e-03 3.29331891e-03
-4.97825304e-03 -2.03667139e-03 3.60272871e-03 7.44500838e-04
-4.40325850e-04 6.38399797e-04 -4.23364760e-03 -4.56386572e-03
4.77551389e-03 4.74880403e-03 7.06148741e-04 -1.24937459e-03
-9.50689311e-04 -3.88551364e-03 -4.45985980e-03 -1.15060725e-03
3.27067473e-03 4.54987818e-03 2.62327422e-03 -2.40981602e-03
4.55576897e-04 3.19155119e-03 -3.84227419e-03 -1.17610034e-03
-1.45622855e-03 -4.32460709e-03 -4.12792247e-03 -1.74557802e-03
4.66075348e-04 3.39668151e-03 -4.00651991e-03 1.41077011e-03
-7.89384532e-04 -6.56061340e-04 1.14822399e-03 4.12205653e-03
3.60721885e-03 -3.11746349e-04 1.44255662e-03 3.11965472e-03
-4.93455213e-03 4.80490318e-03 2.79991422e-03 4.93505970e-03
3.69034940e-03 4.76422161e-03 -1.25827035e-03 -1.94680784e-03]
...
[ -3.92252317e-04 -3.66805331e-03 1.52376946e-03 -3.81564132e-05
-2.57118000e-03 -4.46725264e-03 2.36480637e-03 -4.70252614e-03
-4.18651942e-03 4.54758806e-03 4.38804098e-04 1.28351408e-03
3.40470579e-03 1.00038981e-03 -1.06557179e-03 4.67202952e-03
4.50591929e-03 -2.67829909e-03 2.57702312e-03 -3.65824508e-03
-4.54068230e-03 2.20785337e-03 -1.00554363e-03 5.14690124e-04
4.64830594e-03 1.91410910e-03 -4.83837258e-03 6.73376708e-05
-2.37796479e-03 -4.45193471e-03 -2.60163331e-03 1.51159777e-03
4.06868104e-03 2.55690538e-04 -2.54662265e-03 2.64597777e-03
-2.62586889e-03 -2.71554058e-03 5.49281889e-04 -1.38776843e-03
-2.94354092e-03 -1.13887887e-03 4.59292997e-03 -1.02300232e-03
2.27600057e-03 -4.88117011e-03 1.95790920e-03 4.64376673e-04
2.56658648e-03 8.90390365e-04 -1.40368659e-03 -6.40658545e-04
-3.53228673e-03 -1.30717538e-03 -1.80223631e-03 2.94505036e-03
-4.82233381e-03 -2.16079340e-03 2.58940039e-03 1.60595961e-03
-1.22245611e-03 -6.72614493e-04 4.47060820e-03 -4.95934719e-03
2.70283176e-03 2.93257344e-03 2.13279200e-04 2.59435410e-03
2.98801321e-03 -2.79974379e-03 -1.49789048e-04 -2.53924704e-03
-7.83207070e-04 1.18357304e-03 -1.27669750e-03 -4.16665291e-03
1.40916929e-03 1.63017987e-07 1.36708119e-03 -1.26687710e-05
1.24729215e-03 -2.50442210e-03 -3.20308795e-03 -1.41550787e-03
-1.05747324e-03 -3.97984264e-03 2.25877413e-03 -1.28316227e-03
3.60359484e-03 -1.97929185e-04 3.21712159e-03 -4.96298913e-03
-1.83640339e-03 -9.90608009e-04 -2.03964626e-03 -4.87274351e-03
7.24950165e-04 3.85614252e-03 -4.18979349e-03 2.73840013e-03]
Using tfidf, k-means would be implemented by the lines
kmeans = KMeans(n_clusters = 5)
kmeans.fit(tfidf_matrix)
Using word2vec, k-means would be implemented by the lines
kmeans = KMeans(n_clusters = 5)
kmeans.fit(word2vec_matrix)
(Here's an example of k-means with word2vec). So in the first case, k-means gets a matrix with the tf-idf values of each word per document, while in the second case k-means gets a vector for each word. How can k-means cluster the documents in the second case if it just has the word2vec representations?
Upvotes: 3
Views: 626
Reputation: 2123
Since you are interested in clustering documents, probably the best you can do is to use the Doc2Vec package, which can prepare a vector for each one of your documents. Then you can apply any clustering algorithm to the set of your document vectors for further processing. If, for any reason, you want to use word vectors instead, there are a few things you can do. Here is a very simple method:
Don't try to average all the words in a document, it won't work.
Upvotes: 1