Reputation: 1395
I am trying to understand why word2vec's skipgram model has 2 representations for each word (the hidden representation which is the word embedding) and the output representation (also called context word embedding) . Is this just for generality where the context can be anything (not just words) or is there a more fundamental reason
Upvotes: 19
Views: 7537
Reputation: 731
There are two vector representations for every word.
Upvotes: -1
Reputation: 671
Check the footnote on page 2 of this: http://arxiv.org/pdf/1402.3722v1.pdf
This gives a quite clear intuition for the problem.
But you can also use only one vector to represent a word. Check this (Stanford CS 224n) https://youtu.be/ERibwqs9p38?t=2064
I am not sure how that will be implemented(neither does the video explains).
Upvotes: 5
Reputation: 241
I recommend you to read this article about Word2Vec : http://arxiv.org/pdf/1402.3722v1.pdf
They give an intuition about why two representations in a footnote : it is not likely that a word appears in its own context, so you would want to minimize the probability p(w|w). But if you use the same vectors for w as context than for w as center word, you cannot minimize p(w|w) (computed via the dot product) if you are to keep word embeddings in the unit circle.
But it is just an intuition, I don't know if there is any clear justification to this...
IMHO, the real reason why you use different representations is because you manipulate entities of different nature. "dog" as a context is not to be considered the same as "dog" as a center word because they are not. You basicly manipulate big matrices of occurences (word,context), trying to maximize the probability of these pairs that actually happen. Theoreticaly you could use as contexts bigrams, trying to maximize for instance the probability of (word="for", context="to maximize"), and you would assign a vector representation to "to maximize". We don't do this because there would be too many representations to compute, and we would have a reeeeeally sparse matrix, but I think the idea is here : the fact that we use "1-grams" as context is just a particular case of all the kinds of context we could use.
That's how I see it, and if it's wrong please correct !
Upvotes: 15
Reputation: 768
The word2vec model can be thought of as a simplified neural network model with one hidden layer and no non-linear activation. The model given a word tries to predict the context words in which it appears.
Since, it's a neural network it needs input, output and an objective function. The input and output are just one-hot encodings of the words and the objective function is cross-entropy loss with softmax activation at the output.
The input to hidden weight matrix multiplies with one-hot encoding input selecting a unique column for every word. Similarly, the hidden to output matrix turns out that it can interpreted as rows corresponding to every context word(same one-hot encoding output plays part here).
Upvotes: -2