Word2Vec: Word not in vocabulary even though its in corpus

Question

test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

def prep_corpus():
    sentences = []
    for x in test['title']:
        sentences.append(x.strip().split())

    for x in train['title']:
        sentences.append(x.strip().split())

    return sentences

corpus = prep_corpus()

Corpus is list of sentences where one sentence is one list of words:

word_model = Word2Vec(corpus, workers = 2,sg=1, iter = 5)

word_model['maybelline', 'clear'].shape

I have a word vector that seems to work:

However, when I try to do word_model['intensity], I get the error message: "word 'intensity' not in vocabulary"

This is despite the fact that the word intensity is in the corpus list. It appears once in the test.

I checked the corpus list by integrating through it and found the index of the sentence containing 'intensity'

I also checked the dataframe and found it inside:

There are also some words that are in the corpus list, but not in the word2vec vocab.

I tried using both cbow and skipgram and trying different epochs of 1,5,15.

In all scenarios, I still encounter this error. How do I solve this problem?

gojomo · Accepted Answer

It's likely you're using the gensim Word2Vec implementation.

That implementation, like the original word2vec.c code, enforces a default min_count for words of 5. Words with fewer than 5 examples will be ignored. In general, this greatly improves the quality of the remaining word-vectors.

(Words with just one, or a few, usage example don't get strong word-vectors themselves, as there is insufficient variety to reflect their real meanings in the larger language, and their few examples influence the model far less than other words with more examples. But, since there tend to many, many such words with few examples, in total they wind up diluting/interfering-with the learning the model could do, on other words with plentiful examples.)

You can set min_count=1 to retain such words, but compared to discarding those rare words:

the rare words' vectors will be poor
the rare words' presence will make the model much larger and noticeably slower to train
the vectors for other more-common words will be marginally worse

Word2Vec: Word not in vocabulary even though its in corpus

Answers (1)

Related Questions