Reputation: 763
I am trying to use CoNLL-2003 NER (English) Dataset and I am trying to utilize pretrained embeddings for it. I am using SENNA pretrained embeddings. Now I have around 20k words in my vocabulary and out of this I have embedding available for only 9.5k words.
My current approach is to initialize an array of 20k X embedding_size
with zeros and initialize the 9.5k words whose embeddings is known to me and make all the embeddings learn-able.
My question is what is the best way to do this? Any reference to such research will be very helpful?
Upvotes: 2
Views: 4932
Reputation: 7667
Good suggestions, which will probably do for most applications. If you want to get fancy and state-of-the-art, then you can train a model to predict unknown word embeddings. Take a look at this recent EMNLP 2017 paper: https://arxiv.org/pdf/1707.06961.pdf
TL;DR given a set of known word embeddings, the idea is to train a character-level BiLSTM which attempts to predict the embeddings given solely the characters of the word. Then this net can generalize to predict embeddings for unknown words. Ideally the net captures some morphological information, e.g. the predicted embedding for apples
will be close to apple
, and the evaluations in the paper seem to support this hypothesis.
There's a GitHub repository with pretrained models here: https://github.com/yuvalpinter/mimick
Upvotes: 3
Reputation: 1582
I would suggest three ways to tackle this problem, each with different strengths:
apple
, choose the closest (according to Levenshtein distance) word that you have an embeddings for, e.g., apples
. In my experience, this can work remarkably well. On the other hand, semantic similarity would suggest using for instance synonyms, obtained from resources like WordNet or even averaging the embeddings of words that the OOV frequently co-occurs with. Upvotes: 4
Reputation: 53758
Your approach sounds good, if you can train any meaningful embedding for these out-of-vocabulary words, which might be tricky because they are rare. If you can't, their embeddings won't be much better than just random ones.
In practice, all out-of-vocabulary words are often converted to <UNK>
and all get plain zero embedding. In this case you don't need to store all these zeros in the embedding matrix, instead do a smart lookup, which selects the embedding vector if the index is is in vocabulary or zeros otherwise. If you're using tensorflow, that's exactly what tf.nn.embedding_lookup
does. So the embedding matrix would be smaller (10k x embedding_size
) and the training would be faster.
I'm not sure if there's a lot of research of OOV words, but for the reference I can mention Google's Neural Machine Translation system:
Given the categorical nature of words, the model must first look up the source and target embeddings to retrieve the corresponding word representations. For this embedding layer to work, a vocabulary is first chosen for each language. Usually, a vocabulary size V is selected, and only the most frequent V words are treated as unique. All other words are converted to an "unknown" token and all get the same embedding. The embedding weights, one set per language, are usually learned during training.
Upvotes: 1