BlackBeard
BlackBeard

Reputation: 63

Understanding Word2Vec's Skip-Gram Structure and Output

I have a two-fold question about the Skip-Gram model in Word2Vec:

The way I imagine it is something along the following lines (made-up example):

Assuming the vocabulary ['quick', 'fox', 'jumped', 'lazy', 'dog'] and a context of C=1, and assuming that for the input word 'jumped' I see the two output vectors looking like this:

[0.2 0.6 0.01 0.1 0.09]

[0.2 0.2 0.01 0.16 0.43]

I would interpret this as 'fox' being the most likely word to show up before 'jumped' (p=0.6), and 'dog' being the most likely to show up after it (p=0.43).

Do I have this right? Or am I completely off?

Upvotes: 6

Views: 4397

Answers (2)

KeshavKolluru
KeshavKolluru

Reputation: 66

Your understanding in both parts seem to be correct, according to this paper :

http://arxiv.org/abs/1411.2738

The paper explains word2vec in detail and at the same time, keeps it very simple - it's worth a read for a thorough understanding of the neural net architecture used in word2vec.

  • The structure of Skip Gram does use a single neural net, with input as one-hot encoded target-word and expected-output as one-hot encoded context words. After the neural-net is trained on the text-corpus, the input weight matrix W is used as the input-vector representations of words in the corpus and the output weight matrix W' which is shared across all the C outputs (output-vectors in the terminology of the question, but avoiding that to prevent confusion with output-vector representations used next..), becomes the output-vector representations of words. Usually the output-vector representations are ignored, and the input-vector representations, W are used as the word embeddings. To get into the dimensionality of the matrices, if we assume a vocabulary size of V, size of hidden layer as N, we will have W as (V,N) matrix, with each row representing the input vector of the indexed word in the vocabulary. W' will be a (N,V) matrix, with each column representing the output vector of the indexed word. In this way we get N-dimensional vectors for words.
  • As you mentioned, each of the outputs(avoiding using the term output vector) is of size V and are the result of a softmax function, with each node in the output giving the probability of the word occurring as a context word for the given target word, resulting in the outputs not being one-hot encoded.But the expected outputs are indeed one-hot encoded, i.e in training phase, the error is computed by subtracting the one-hot encoded vector of the actual word occurring at that context position, from the neural-net output and then the weights are updated using gradient descent.

Referring to the example you mentioned, with C=1 and with a vocabulary of ['quick', 'fox', 'jumped', 'lazy', 'dog'].

If the output from the skip-gram is [0.2 0.6 0.01 0.1 0.09], where the correct target word is 'fox' then error is calculated as:

[0 1 0 0 0] - [0.2 0.6 0.01 0.1 0.09] = [-0.2 0.4 -0.01 -0.1 -0.09]

and the weight matrices are updated to minimize this error.

Upvotes: 5

Chang-Uk Shin
Chang-Uk Shin

Reputation: 61

No. You can set the length of the vector freely.

Then, what is the vector?

It is distributed representation of meaning of the word.

I don't understand exactly how it can be trained; but, the trained one is meaning like below.

If one vector representation like this,

[0.2 0.6 0.2]

It is closer to [0.2 0.7 0.2] than [0.7 0.2 0.5].

Here is another example.

CRY [0.5 0.7 0.2]
HAPPY [-0.4 0.3 0.1]
SAD [0.4 0.6 0.2]

'CRY' is more close to 'SAD' than 'HAPPY' because the methods (CBOW or SKIP-GRAM, etc.) can make the vectors more closely when the meaning (or syntactic position) of the words are similar.

Actually, The accurate depends on many things, quite. Selecting methods are also important. and the large amount of good data (corpora), also.

If you want to check the similarity of some words, you make the vectors of words first, and check the cosine similarity of that words.

This paper explained some methods and listed accuracies.

If you can understand C codes, the word2vec program is useful. It implements CBOW (Continuous Bag-Of-Words) and SKIP-gram.

Upvotes: 0

Related Questions