Reputation: 1
I have generated a word2vec model using gensim for a huge corpus and I need to cluster the vocabularies using k means clustering for that i need:
for the feature matrix i tried to use x=model.wv and I got the object type as gensim.models.keyedvectors.KeyedVectors and its much smaller than what I expected a feature matrix will be
is there a way to use this object directly to generate the k-means clustering ?
Upvotes: 0
Views: 1472
Reputation: 54163
In gensim's Word2Vec model, the raw number_of-words x number_of_features
numpy array of word vectors is in model.wv.vectors
. (In older Gensim versions, the .vectors
property was named .syn0
matching the original Google word2vec.c
naming).
You can use the model.wv.key_to_index
dict (previously .vocab
) to learn the string-token-to-array-slot assignment, or the model.wv.index_to_key
list (previously .index2word
) to learn the array-slot-to-word assignment.
The pairwise distances aren't pre-calculated, so you'd have to create that yourself. And with typical vocabulary sizes, it may be impractically large. (For example, with a 100,000 word vocabulary, storing all pairwise distances in the most efficient way possible would require roughly 100,000^2 * 4 bytes/float / 2 = 20GB
of addressable space.
Upvotes: 2