BuzzyBee
BuzzyBee

Reputation: 1

getting distance matrix and features matrix from word2vec model

I have generated a word2vec model using gensim for a huge corpus and I need to cluster the vocabularies using k means clustering for that i need:

  1. cosine distance matrix (word to word, so the size of the matrix the number_of_words x number_of_words )
  2. features matrix (word to features, so the size of the matrix is the number_of_words x number_of_features(200) )

for the feature matrix i tried to use x=model.wv and I got the object type as gensim.models.keyedvectors.KeyedVectors and its much smaller than what I expected a feature matrix will be

is there a way to use this object directly to generate the k-means clustering ?

Upvotes: 0

Views: 1472

Answers (1)

gojomo
gojomo

Reputation: 54163

In gensim's Word2Vec model, the raw number_of-words x number_of_features numpy array of word vectors is in model.wv.vectors. (In older Gensim versions, the .vectors property was named .syn0 matching the original Google word2vec.c naming).

You can use the model.wv.key_to_index dict (previously .vocab) to learn the string-token-to-array-slot assignment, or the model.wv.index_to_key list (previously .index2word) to learn the array-slot-to-word assignment.

The pairwise distances aren't pre-calculated, so you'd have to create that yourself. And with typical vocabulary sizes, it may be impractically large. (For example, with a 100,000 word vocabulary, storing all pairwise distances in the most efficient way possible would require roughly 100,000^2 * 4 bytes/float / 2 = 20GB of addressable space.

Upvotes: 2

Related Questions