snapcrack
snapcrack

Reputation: 1813

gensim Phrases not observing min_count parameter

I'm trying to train a gensim Word2Vec model with bigrams. To get the bigrams, I run the following code, with sentences standing for a long list of split sentences using nltk.sent_tokenize, lemmatized with Spacy and then lowercased:

from gensim.models import Word2Vec, Phrases

bigrams = Phrases(sentences, min_count=20, threshold=10)

This could only include bigrams which occur >= 20 times. But when I run bigrams.vocab, I get:

defaultdict(int,
             b'inflated': 237,
             b'the_inflated': 34,
             b'inflated_bag': 1,
             b'let': 6841,
             b'bag_let': 1,
             b'let_-pron-': 3723,
             ...)

From what I understand, inflated_bag and let_-pron- should not be present. Is there something I'm doing wrong? Or am I misinterpreting the output?

Upvotes: 0

Views: 382

Answers (1)

gojomo
gojomo

Reputation: 54223

In the gensim Phrases source code, min_count is an adjustable input to the formula for deciding which bigrams should be combined.

It isn't a strict cutoff (like the parameter of the same name in Word2Vec & related classes), below which any unigrams/bigrams are ignored or outright discarded from the survey counts.

(The doc-comments in gensim's phrases.py are somewhat misleading in this regard.)

Upvotes: 1

Related Questions