Reputation: 7
If I have a word2vec model and I use it for embedding all words in train and test set. But with proper words, in word2vec model does not contain. And can I random a vector as a embedding for all proper words. If can, please give me some tips and some paper references. Thank you
Upvotes: 0
Views: 1757
Reputation: 54173
It's not clear what you're asking; in particular what do you mean by "proper words"?
But, if after training, words that you expect to be in the model aren't in the model, that is usually caused by either:
(1) Problems with how you preprocessed/tokenized your corpus, so that the words you thought were provided were not. So double check what data you're passing to training.
(2) A mismatch of parameters and expectations. For example, if performing training with a min_count
of 5 (the default in some word2vec libraries), any words occurring fewer than 5 times will be ignored, and thus not receive word-vectors. (This is usually a good thing for overall word-vector quality, as low-frequency words can't get good word-vectors for themselves, yet by being interleaved with other words can still mildly interfere with those other words' training.)
Usually double-checking inputs, enabling logging and watching for any suspicious indicators of problems, and carefully examining the post-training model for what it does contain can help deduce what went wrong.
Upvotes: 1