words not available in corpus for Word2Vec training

Question

I am totally new to Word2Vec. I want to find cosine similarity between word pairs in my data. My codes are as follows:

import pandas as pd
from gensim.models import Word2Vec
model = Word2Vec(corpus_file="corpus.txt", sg=0, window =7, size=100, min_count=10, iter=4)
vocabulary = list(model.wv.vocab)
data=pd.read_csv("experiment.csv")
cos_similarity = model.wv.similarity(data['word 1'], data['word 2'])

The problem is some words in the data columns of my "experiment.csv" file: "word 1" and "word 2" are not present in the corpus file ("corpus.txt"). So this error is returned:

"word 'setosa' not in vocabulary"

What should I do to handle words that are not present in my input corpus? I want to assign words in my experiment that are not present in the input corpus the vector zero, but I am stuck how to do it.

Any ideas for my problems?

gojomo · Accepted Answer

It's really easy to give unknown words the origin (all 'zero') vector:

word = data['word 1']
if word in model.wv:
    vec = model[word]
else: 
    vec = np.zeros(100)

But, this is unlikely what you want. The zero vector can't be cosine-similarity compared to other vectors.

It's often better to simply ignore unknown words. If they were so rare that your training data didn't haven enough of them to create a vector, they can't contribute much to other analyses.

If they're still important, the best approach is to get more data, with realistic usage contexts, so they get meaningful vectors.

Another alternative is to use an algorithm, such as the word2vec variant FastText, which can always synthesize a guess-vector for any words that were out-of-vocabulary (OOV) based on the training data. It does this by learning word-vectors for word-fragments (charactewr n-grams), then assembling a vector for a new unknown word from those fragments. It's often better than random, because unknown words are often typos or variants of known words with which they share a lot of segments. But it's still not great, and for really odd strings, essentially returns a random vector.

Another tactic I've seen used, but wouldn't personally recommend, is to replace a lot of the words that would otherwise be ignored – such as those with fewer than min_count occurrences – with some single plug token, like say ''. Then that synthetic token becomes a quite-common word, but gets an almost entirely meaningless: a random low-magnitude vector. (The prevalence of this fake word & noise-vector in training will tend to make other surrounding words' vectors worse or slower-to-train, compared to simply eliding the low-frequency words.) But then, when dealing with later unknown words, you can use this same '' pseudoword's vector as a not-too-harmful stand-in.

But again: it's almost always better to do some combination of – (a) more data; (b) ignoring rare words; (c) using a algorithm like FastText which can synthesize better-than-nothing vectors – than collapse all unknown words to a single nonsense vector.

words not available in corpus for Word2Vec training

Answers (1)

Related Questions