Reputation: 3
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')
def prep_corpus():
sentences = []
for x in test['title']:
sentences.append(x.strip().split())
for x in train['title']:
sentences.append(x.strip().split())
return sentences
corpus = prep_corpus()
Corpus is list of sentences where one sentence is one list of words:
word_model = Word2Vec(corpus, workers = 2,sg=1, iter = 5)
word_model['maybelline', 'clear'].shape
I have a word vector that seems to work:
However, when I try to do word_model['intensity], I get the error message: "word 'intensity' not in vocabulary"
This is despite the fact that the word intensity is in the corpus list. It appears once in the test.
I checked the corpus list by integrating through it and found the index of the sentence containing 'intensity'
I also checked the dataframe and found it inside:
There are also some words that are in the corpus list, but not in the word2vec vocab.
I tried using both cbow and skipgram and trying different epochs of 1,5,15.
In all scenarios, I still encounter this error. How do I solve this problem?
Upvotes: 0
Views: 952
Reputation: 54173
It's likely you're using the gensim
Word2Vec
implementation.
That implementation, like the original word2vec.c
code, enforces a default min_count
for words of 5. Words with fewer than 5 examples will be ignored. In general, this greatly improves the quality of the remaining word-vectors.
(Words with just one, or a few, usage example don't get strong word-vectors themselves, as there is insufficient variety to reflect their real meanings in the larger language, and their few examples influence the model far less than other words with more examples. But, since there tend to many, many such words with few examples, in total they wind up diluting/interfering-with the learning the model could do, on other words with plentiful examples.)
You can set min_count=1
to retain such words, but compared to discarding those rare words:
Upvotes: 2