lordzuko
lordzuko

Reputation: 763

How to initialize word-embeddings for Out of Vocabulary Word?

I am trying to use CoNLL-2003 NER (English) Dataset and I am trying to utilize pretrained embeddings for it. I am using SENNA pretrained embeddings. Now I have around 20k words in my vocabulary and out of this I have embedding available for only 9.5k words.
My current approach is to initialize an array of 20k X embedding_size with zeros and initialize the 9.5k words whose embeddings is known to me and make all the embeddings learn-able.

My question is what is the best way to do this? Any reference to such research will be very helpful?

Upvotes: 2

Views: 4932

Answers (3)

jayelm
jayelm

Reputation: 7667

Good suggestions, which will probably do for most applications. If you want to get fancy and state-of-the-art, then you can train a model to predict unknown word embeddings. Take a look at this recent EMNLP 2017 paper: https://arxiv.org/pdf/1707.06961.pdf

TL;DR given a set of known word embeddings, the idea is to train a character-level BiLSTM which attempts to predict the embeddings given solely the characters of the word. Then this net can generalize to predict embeddings for unknown words. Ideally the net captures some morphological information, e.g. the predicted embedding for apples will be close to apple, and the evaluations in the paper seem to support this hypothesis.

There's a GitHub repository with pretrained models here: https://github.com/yuvalpinter/mimick

Upvotes: 3

geompalik
geompalik

Reputation: 1582

I would suggest three ways to tackle this problem, each with different strengths:

  • Instead of using the SENNA embeddings, try using FastText embeddings. The advantage here is that they can infer embeddings for OOV words using character n-grams. For the exact methodology used, check the associated paper. Gensim has implemented all the functionality needed. This will greatly reduce the problem, and you can further fine-tune the induced embeddings as you describe. The inconvenience is that you have to change from Senna to FastText.
  • Try using morphological or sementic similarity to initialize the OOV words. For morphological, I mean using a distance like Levenshtein to select an embedding. For an OOV word like apple, choose the closest (according to Levenshtein distance) word that you have an embeddings for, e.g., apples. In my experience, this can work remarkably well. On the other hand, semantic similarity would suggest using for instance synonyms, obtained from resources like WordNet or even averaging the embeddings of words that the OOV frequently co-occurs with.
  • After having reduced the sparsity with the ways described above, then proceed with or random initialization that is discussed in other responses.

Upvotes: 4

Maxim
Maxim

Reputation: 53758

Your approach sounds good, if you can train any meaningful embedding for these out-of-vocabulary words, which might be tricky because they are rare. If you can't, their embeddings won't be much better than just random ones.

In practice, all out-of-vocabulary words are often converted to <UNK> and all get plain zero embedding. In this case you don't need to store all these zeros in the embedding matrix, instead do a smart lookup, which selects the embedding vector if the index is is in vocabulary or zeros otherwise. If you're using tensorflow, that's exactly what tf.nn.embedding_lookup does. So the embedding matrix would be smaller (10k x embedding_size) and the training would be faster.

I'm not sure if there's a lot of research of OOV words, but for the reference I can mention Google's Neural Machine Translation system:

Given the categorical nature of words, the model must first look up the source and target embeddings to retrieve the corresponding word representations. For this embedding layer to work, a vocabulary is first chosen for each language. Usually, a vocabulary size V is selected, and only the most frequent V words are treated as unique. All other words are converted to an "unknown" token and all get the same embedding. The embedding weights, one set per language, are usually learned during training.

Upvotes: 1

Related Questions