Reputation: 1
I am using Word2vec model to extract similar words, but I want to know if it is possible to get words while using unseen words for input.
For example, I have a model trained with a corpus [melon, vehicle, giraffe, apple, frog, banana]. "orange" is unseen word in this corpus, but when I put it as input, I want [melon, apple, banana] for result.
Is this a possible situation?
Upvotes: 0
Views: 645
Reputation: 54173
The original word2vec algorithm can offer nothing for words that weren't in its training data.
Facebook's 'FastText' descendent of the word2vec algorithm can offer better-than-random vectors for unseen words – but it builds such vectors from word fragments (character n-gram vectors), so it does best where shared word roots exist, or where the out-of-vocabulary word is just a typo of a trained word.
That is, it won't help in your example, if no other words morphologically similar to 'orange' (like 'orangey', 'orangade', 'orangish', etc) were present.
The only way to learn or guess a vector for 'orange' is to have some training examples with it or related words. (If all else failed, you could scrape some examples from other large corpora or the web to mix with your other training data.)
Upvotes: 2