Kiran Baktha
Kiran Baktha

Reputation: 667

Word2vec model query

I trained a word2vec model on my dataset using the word2vec gensim package. My dataset has about 131,681 unique words but the model outputs a vector matrix of shape (47629,100). So only 47,629 words have vectors associated with them. What about the rest? Why am I not able to get a 100 dimensional vector for every unique word?

Upvotes: 0

Views: 329

Answers (1)

gojomo
gojomo

Reputation: 54153

The gensim Word2Vec class uses a default min_count of 5, meaning any words appearing fewer than 5 times in your corpus will be ignored. If you enable INFO level logging, you should see logged messages about this and other steps taken by the training.

Note that it's hard to learn meaningful vectors with few (on non-varied) usage examples. So while you could lower the min_count to 1, you shouldn't expect those vectors to be very good – and even trying to train them may worsen your other vectors. (Low-occurrence words can be essentially noise, interfering with the training of other word-vectors, where those other more-frequent words do have sufficiently numerous/varied examples to be better.)

Upvotes: 1

Related Questions