Reputation: 3783
I'm training a word embedding using GENSIM (word2vec) and use the trained model in a neural network in KERAS. A problem arises when I have an unknown (out-of-vocabulary) word so the neural network doesn't work anymore because it can't find weights for that specific word. I think one way to fix this problem is adding a new word (<unk>
) to the pre-trained word embedding with zero weights (or maybe random weights? which one is better?) Is this approach fine? Also, for this word embedding, the weights are not trainable in this neural network.
Upvotes: 1
Views: 1335
Reputation: 54173
Most typical is to ignore unknown words. (Replacing them with either a plug-word, or the origin-vector, is more distorting.)
You could also consider training a FastText
mode instead, which will always synthesize some guess-vector for an out-of-vocabulary word, from the character-n-gram vectors created during training. (These synthetic vectors are often better than nothing, especially when a word has overlapping word roots with related words – but getting more training data with examples of all relevant word usages is better, and simply ignoring rare unknown words isn't that bad.)
Upvotes: 1