etang
etang

Reputation: 538

Minimum Number of Words for Each Sentence for Training Gensim Word2vec Model

Suppose I have a corpus of short sentences of which the number of words ranges from 1 to around 500 and the average number of words is around 9. If I train a Gensim Word2vec model using window=5(which is the default), should I use all of the sentences? or I should remove sentences with low word count? If so, is there a rule of thumb for the minimum number of words?

Upvotes: 0

Views: 271

Answers (1)

gojomo
gojomo

Reputation: 54213

Texts with only 1 word are essentially 'empty' to the word2vec algorithm: there are no neighboring words, which are necessary for all training modes. You could drop them, but there's little harm in leaving them in, either. They're essentially just no-ops.

Any text with 2 or more words can contribute to the training.

Upvotes: 1

Related Questions