Reputation: 1146
I am learning Stanford NLP Course and I have issue understanding a concept in Skipgram from picture below.
From left to right, the first column vector is one-hot encoder, the second is the word embedding matrix from 1-layer neural network, the third is word representation vector. However, when it comes to the fourth one, which is a matrix with 'v by d' dimension. Not sure if I listen it correctly, but the speaker said this is a representation of context word and these three matrix are identical?
My questions are: 1. Why these three matrix are identical but the three multiplication results are different? 2. How do we get this matrix (v by d dimension)?
The link to the lecture is:
https://www.youtube.com/watch?v=ERibwqs9p38&t=1481s
Upvotes: 3
Views: 1790
Reputation: 4318
Before answering your question I have to add a bit of background for the sake of argument from previous slides. First, the optimization is on the probability of one word co-occurring with another word: the center word and a context word. The vector representations could be shared between these two but practically we have two collections of matrixes (list of word vectors) 1. center word vectors (first red matrix on the left) 2. context word vectors (three red matrices in the middle).
The picture in this question shows how we estimate the probabilities with the multiplication of two kinds of vectors and the softmax normalization. Now the questions:
- How do we get this matrix (v by d dimension)?
As it was mentioned before, this can be the same matrix as word vectors but transposed. Or, you can imagine that we learning two vectors for each word: 1. center 2. context
The context word-vectors in calculations are used in its transposed form:
(center words-vectors, v) W : (d,V)
(outside words-vectors, uT) W': (V,d)
V
being the size of vocabulary and d
the dimension size of the vectors. (these are the parameters which we want to learn from data)
Notice how dimensions change in each matrix multiplication:
W: (d,V)
x: (V,1)
v = W.x: (d,1)
W': (V,d)
W'.v: (V,1)
x
is the one-hot encoding of the center word, W
is the list of all word vectors. W.x
multiplication basically select the right word vector out of this list. The final result is a list of all possible dot-product of context word vector and the center word vector. The one-hot vector of the true observed context word selects the intended results. Then, based on the loss, updates will be backpropagated through the computation flow updating W
and W'
.
- Why these three matrix are identical but the three multiplication results are different?
The square and two rhombi in the middle are representing one matrix. The three multiplications are happening in three different observations. Although they represent the same matrix, on each observation parameters (W
and W'
) change using backpropagations. That is why the results are different on three multiplications.
UPDATE FROM CHAT However, your expectations are valid, the presentation could show exactly the same results in these multiplications. Because the objective function is the sum of all co-occurrence probabilities in one window.
Upvotes: 2
Reputation: 4291
You don't multiply three times with the same matrix. You multiply only once. You get an output vector same as the size of Vocabulary. I will try to explain with an example.
suppose your model has V(vocab_size) = 6
, d = 4
, C(number of context words) = 2
, Wi(word_embedding matrix) size= 6 X 4
, Wo(output word representation) size = 4 X 6
.
A training example x = [0,1,0,0,0,0]
and y = y = [[0,0,0,1,0,0], [1,0,0,0,0,0]] (two one-hot encoded vectors) one for each context word
.
Now, suppose after feeding and processing the input( h = x*Wi; z = h*Wo
), the output(z
) you get is
z = [0.01520237, 0.84253418, 0.4773877 , 0.96858308, 0.09331018,0.54090063]
# take softmax, you will get
sft_max_z = [0.0976363 , 0.22331452, 0.15500148, 0.25331406, 0.1055682,0.16516544]
# sft_max_z represent the probability of each word occuring as input's context words.
#Now, subtract sft_max_z with each one-hot encoded vector in y to get the errors.
# errors = [[-0.0976363 , -0.22331452, -0.15500148, 0.74668594, -0.1055682 ,
-0.16516544],
[ 0.9023637 , -0.22331452, -0.15500148, -0.25331406, -0.1055682 ,
-0.16516544]]
Now, you can reduce the error and do backpropagation for training. If you're predicting then select the two context words with highest probabilities(1, 3 in this case).
Think it of as a classification problem having more than 1 class(multinomial classification) and same object can belong to multiple classes at the same time(multilabel classification.)
Upvotes: 1