vkmv
vkmv

Reputation: 1395

Why does word2vec use 2 representations for each word?

I am trying to understand why word2vec's skipgram model has 2 representations for each word (the hidden representation which is the word embedding) and the output representation (also called context word embedding) . Is this just for generality where the context can be anything (not just words) or is there a more fundamental reason

Upvotes: 19

Views: 7537

Answers (4)

Abhinav Ravi
Abhinav Ravi

Reputation: 731

There are two vector representations for every word.

  1. First vector representation comes when a particular word "W" is treated as the centre word (c) and neighbourhood window words act as the context words (O).enter image description here
  2. Second vector representation is when the same word "W" is treated as a "neighbourhood window context word" for other adjacent words which in turn act as centre words now.

enter image description here

Upvotes: -1

dust
dust

Reputation: 671

Check the footnote on page 2 of this: http://arxiv.org/pdf/1402.3722v1.pdf

This gives a quite clear intuition for the problem.

But you can also use only one vector to represent a word. Check this (Stanford CS 224n) https://youtu.be/ERibwqs9p38?t=2064

I am not sure how that will be implemented(neither does the video explains).

Upvotes: 5

HediBY
HediBY

Reputation: 241

I recommend you to read this article about Word2Vec : http://arxiv.org/pdf/1402.3722v1.pdf

They give an intuition about why two representations in a footnote : it is not likely that a word appears in its own context, so you would want to minimize the probability p(w|w). But if you use the same vectors for w as context than for w as center word, you cannot minimize p(w|w) (computed via the dot product) if you are to keep word embeddings in the unit circle.

But it is just an intuition, I don't know if there is any clear justification to this...

IMHO, the real reason why you use different representations is because you manipulate entities of different nature. "dog" as a context is not to be considered the same as "dog" as a center word because they are not. You basicly manipulate big matrices of occurences (word,context), trying to maximize the probability of these pairs that actually happen. Theoreticaly you could use as contexts bigrams, trying to maximize for instance the probability of (word="for", context="to maximize"), and you would assign a vector representation to "to maximize". We don't do this because there would be too many representations to compute, and we would have a reeeeeally sparse matrix, but I think the idea is here : the fact that we use "1-grams" as context is just a particular case of all the kinds of context we could use.

That's how I see it, and if it's wrong please correct !

Upvotes: 15

Rudra Murthy
Rudra Murthy

Reputation: 768

The word2vec model can be thought of as a simplified neural network model with one hidden layer and no non-linear activation. The model given a word tries to predict the context words in which it appears.

Since, it's a neural network it needs input, output and an objective function. The input and output are just one-hot encodings of the words and the objective function is cross-entropy loss with softmax activation at the output.

The input to hidden weight matrix multiplies with one-hot encoding input selecting a unique column for every word. Similarly, the hidden to output matrix turns out that it can interpreted as rows corresponding to every context word(same one-hot encoding output plays part here).

Upvotes: -2

Related Questions