Nico Sid
Nico Sid

Reputation: 43

Pretrained (Word2Vec) embedding in Neural Networks

If I have to use pretrained word vectors as embedding layer in Neural Networks (eg. say CNN), How do I deal with index 0?

Detail:

We usually start with creating a zero numpy 2D array. Later we fill in the indices of words from the vocabulary. The problem is, 0 is already the index of another word in our vocabulary (say, 'i' is index at 0). Hence, we are basically initializing the whole matrix filled with 'i' instead of empty words. So, how do we deal with padding all the sentences of equal length?

One easy pop-up in mind is we can use the another digit=numberOfWordsInVocab+1 to pad. But wouldn't that take more size? [Help me!]

Upvotes: 2

Views: 402

Answers (2)

mr_mo
mr_mo

Reputation: 1528

If I have to use pretrained word vectors as embedding layer in Neural Networks (eg. say CNN), How do I deal with index 0?

Answer

In general, empty entries can be handled via a weighted cost of the model and the targets. However, when dealing with words and sequential data, things can be a little tricky and there are several things that can be considered. Let's make some assumptions and work with that.

Assumptions

  1. We begin with a pre-trained word2vec model.
  2. We have sequences with varying lengths, with at most max_lenght words.

Details

  • Word2Vec is a model that learns a mapping (embedding) from discrete variables (word token = word unique id) to a continuous vector space.
  • The representation in the vector space is such that the cost function (CBOW, Skip-gram, essentially it is predicting word from context in bi-directional way) is minimized on the corpus.
  • Reading basic tutorials (like Google's word2vec tutorial on Tensorflow tutorials) reveals some details on the algorithm, including negative sampling.
  • The implementation is a lookup table. It is faster than the alternative one-hot encoding technique, since the dimensions of a one-hot encoded matrix are huge (say 10,000 columns for 10,000 words, n row for n sequential words). So the lookup (hash) table is significantly faster, and it selects rows from the embedding matrix (for row vectors).

Task

  • Add missing entries (no words) and use it in the model.

Suggestions

  • If there is some use for the cost of missing data, such as using a prediction from that entry and there is a label for that entry, you can add a new value as suggested (can be the 0 index, but all indexes must move i=i+1 and the embedding matrix should have new row at position 0).
  • Following the first suggestion, you need to train the added row. You can use negative sampling for the NaN class vs all. I do not suggest it for handling missing values. It is a good trick to handle an "Unknown word" class.
  • You can weight the cost of those entries by constant 0 for each sample that is shorter that max_length. That is, if we have a sequence of word tokens [0,5,6,2,178,24,0,NaN,NaN], the corresponding weight vector is [1,1,1,1,1,1,1,0,0]
  • You should worry about re-indexing the words and the cost of it. In memory, there is almost no difference (1 vs N words, N is large). In complexity, it is something that can be later incorporated in the initial tokenize function. The predictions and model complexity is a larger issue and more important requirement from the system.
  • There are numerous ways to tackle varying lengths (LSTM, RNNs, now we try CNNs and costs tricks). Read state-of-the-art literature on that issue, I'm sure there is much work. For example, see A Convolutional Neural Network for Modelling Sentences paper.

Upvotes: 0

abhinav dwivedi
abhinav dwivedi

Reputation: 198

One easy pop-up in mind is we can use the another digit=numberOfWordsInVocab+1 to pad. But wouldn't that take more size?

Nope! That's the same size.

a=np.full((5000,5000), 7)
a.nbytes
200000000

b=np.zeros((5000,5000))
b.nbytes
200000000

Edit: Typo

Upvotes: 1

Related Questions