Eghbal
Eghbal

Reputation: 3783

Unknown words in a trained word embedding (Gensim) for using in Keras

I'm training a word embedding using GENSIM (word2vec) and use the trained model in a neural network in KERAS. A problem arises when I have an unknown (out-of-vocabulary) word so the neural network doesn't work anymore because it can't find weights for that specific word. I think one way to fix this problem is adding a new word (<unk>) to the pre-trained word embedding with zero weights (or maybe random weights? which one is better?) Is this approach fine? Also, for this word embedding, the weights are not trainable in this neural network.

Upvotes: 1

Views: 1335

Answers (1)

gojomo
gojomo

Reputation: 54173

Most typical is to ignore unknown words. (Replacing them with either a plug-word, or the origin-vector, is more distorting.)

You could also consider training a FastText mode instead, which will always synthesize some guess-vector for an out-of-vocabulary word, from the character-n-gram vectors created during training. (These synthetic vectors are often better than nothing, especially when a word has overlapping word roots with related words – but getting more training data with examples of all relevant word usages is better, and simply ignoring rare unknown words isn't that bad.)

Upvotes: 1

Related Questions