gensim Phrases not observing min_count parameter

Question

I'm trying to train a gensim Word2Vec model with bigrams. To get the bigrams, I run the following code, with sentences standing for a long list of split sentences using nltk.sent_tokenize, lemmatized with Spacy and then lowercased:

from gensim.models import Word2Vec, Phrases

bigrams = Phrases(sentences, min_count=20, threshold=10)

This could only include bigrams which occur >= 20 times. But when I run bigrams.vocab, I get:

defaultdict(int,
             b'inflated': 237,
             b'the_inflated': 34,
             b'inflated_bag': 1,
             b'let': 6841,
             b'bag_let': 1,
             b'let_-pron-': 3723,
             ...)

From what I understand, inflated_bag and let_-pron- should not be present. Is there something I'm doing wrong? Or am I misinterpreting the output?

gensim Phrases not observing min_count parameter

Answers (1)

Related Questions