Understanding Word2Vec's Skip-Gram Structure and Output

Question

I have a two-fold question about the Skip-Gram model in Word2Vec:

The first part is about structure: as far as I understand it, the Skip-Gram model is based on one neural network with one input weight matrix W, one hidden layer of size N, and C output weight matrices W' each used to produce one of the C output vectors. Is this correct?
The second part is about the output vectors: as far as I understand it, each output vector is of size V and is a result of a Softmax function. Each output vector node corresponds to the index of a word in the vocabulary, and the value of each node is the probability that the corresponding word occurs at that context location (for a given input word). The target output vectors are not, however, one-hot encoded, even if the training instances are. Is this correct?

The way I imagine it is something along the following lines (made-up example):

Assuming the vocabulary ['quick', 'fox', 'jumped', 'lazy', 'dog'] and a context of C=1, and assuming that for the input word 'jumped' I see the two output vectors looking like this:

[0.2 0.6 0.01 0.1 0.09]

[0.2 0.2 0.01 0.16 0.43]

I would interpret this as 'fox' being the most likely word to show up before 'jumped' (p=0.6), and 'dog' being the most likely to show up after it (p=0.43).

Do I have this right? Or am I completely off?

KeshavKolluru · Accepted Answer

Your understanding in both parts seem to be correct, according to this paper :

http://arxiv.org/abs/1411.2738

The paper explains word2vec in detail and at the same time, keeps it very simple - it's worth a read for a thorough understanding of the neural net architecture used in word2vec.

The structure of Skip Gram does use a single neural net, with input as one-hot encoded target-word and expected-output as one-hot encoded context words. After the neural-net is trained on the text-corpus, the input weight matrix W is used as the input-vector representations of words in the corpus and the output weight matrix W' which is shared across all the C outputs (output-vectors in the terminology of the question, but avoiding that to prevent confusion with output-vector representations used next..), becomes the output-vector representations of words. Usually the output-vector representations are ignored, and the input-vector representations, W are used as the word embeddings. To get into the dimensionality of the matrices, if we assume a vocabulary size of V, size of hidden layer as N, we will have W as (V,N) matrix, with each row representing the input vector of the indexed word in the vocabulary. W' will be a (N,V) matrix, with each column representing the output vector of the indexed word. In this way we get N-dimensional vectors for words.
As you mentioned, each of the outputs(avoiding using the term output vector) is of size V and are the result of a softmax function, with each node in the output giving the probability of the word occurring as a context word for the given target word, resulting in the outputs not being one-hot encoded.But the expected outputs are indeed one-hot encoded, i.e in training phase, the error is computed by subtracting the one-hot encoded vector of the actual word occurring at that context position, from the neural-net output and then the weights are updated using gradient descent.

Referring to the example you mentioned, with C=1 and with a vocabulary of ['quick', 'fox', 'jumped', 'lazy', 'dog'].

If the output from the skip-gram is [0.2 0.6 0.01 0.1 0.09], where the correct target word is 'fox' then error is calculated as:

[0 1 0 0 0] - [0.2 0.6 0.01 0.1 0.09] = [-0.2 0.4 -0.01 -0.1 -0.09]

and the weight matrices are updated to minimize this error.

Understanding Word2Vec's Skip-Gram Structure and Output

Answers (2)

Related Questions

Understanding Word2Vec&#39;s Skip-Gram Structure and Output

Answers (2)

Related Questions

Understanding Word2Vec's Skip-Gram Structure and Output