Reputation: 1813
I'm trying to train a gensim Word2Vec model with bigrams. To get the bigrams, I run the following code, with sentences
standing for a long list of split sentences using nltk.sent_tokenize
, lemmatized with Spacy and then lowercased:
from gensim.models import Word2Vec, Phrases
bigrams = Phrases(sentences, min_count=20, threshold=10)
This could only include bigrams which occur >= 20 times. But when I run bigrams.vocab
, I get:
defaultdict(int,
b'inflated': 237,
b'the_inflated': 34,
b'inflated_bag': 1,
b'let': 6841,
b'bag_let': 1,
b'let_-pron-': 3723,
...)
From what I understand, inflated_bag
and let_-pron-
should not be present. Is there something I'm doing wrong? Or am I misinterpreting the output?
Upvotes: 0
Views: 382
Reputation: 54223
In the gensim Phrases
source code, min_count
is an adjustable input to the formula for deciding which bigrams should be combined.
It isn't a strict cutoff (like the parameter of the same name in Word2Vec
& related classes), below which any unigrams/bigrams are ignored or outright discarded from the survey counts.
(The doc-comments in gensim's phrases.py
are somewhat misleading in this regard.)
Upvotes: 1