What does representation matrix of context word mean in SkipGram?

Question

I am learning Stanford NLP Course and I have issue understanding a concept in Skipgram from picture below.

From left to right, the first column vector is one-hot encoder, the second is the word embedding matrix from 1-layer neural network, the third is word representation vector. However, when it comes to the fourth one, which is a matrix with 'v by d' dimension. Not sure if I listen it correctly, but the speaker said this is a representation of context word and these three matrix are identical?

My questions are: 1. Why these three matrix are identical but the three multiplication results are different? 2. How do we get this matrix (v by d dimension)?

The link to the lecture is:

https://www.youtube.com/watch?v=ERibwqs9p38&t=1481s

Mehdi · Accepted Answer

Before answering your question I have to add a bit of background for the sake of argument from previous slides. First, the optimization is on the probability of one word co-occurring with another word: the center word and a context word. The vector representations could be shared between these two but practically we have two collections of matrixes (list of word vectors) 1. center word vectors (first red matrix on the left) 2. context word vectors (three red matrices in the middle).

The picture in this question shows how we estimate the probabilities with the multiplication of two kinds of vectors and the softmax normalization. Now the questions:

How do we get this matrix (v by d dimension)?

As it was mentioned before, this can be the same matrix as word vectors but transposed. Or, you can imagine that we learning two vectors for each word: 1. center 2. context

The context word-vectors in calculations are used in its transposed form:

(center words-vectors, v)  W : (d,V)
(outside words-vectors, uT) W': (V,d)

V being the size of vocabulary and d the dimension size of the vectors. (these are the parameters which we want to learn from data)

Notice how dimensions change in each matrix multiplication:

      W: (d,V)
      x: (V,1)
v = W.x: (d,1) 
     W': (V,d)
   W'.v: (V,1)

x is the one-hot encoding of the center word, W is the list of all word vectors. W.x multiplication basically select the right word vector out of this list. The final result is a list of all possible dot-product of context word vector and the center word vector. The one-hot vector of the true observed context word selects the intended results. Then, based on the loss, updates will be backpropagated through the computation flow updating W and W'.

Why these three matrix are identical but the three multiplication results are different?

The square and two rhombi in the middle are representing one matrix. The three multiplications are happening in three different observations. Although they represent the same matrix, on each observation parameters (W and W') change using backpropagations. That is why the results are different on three multiplications.

UPDATE FROM CHAT However, your expectations are valid, the presentation could show exactly the same results in these multiplications. Because the objective function is the sum of all co-occurrence probabilities in one window.

What does representation matrix of context word mean in SkipGram?

Answers (2)

Related Questions