Reputation: 63
I have a two-fold question about the Skip-Gram model in Word2Vec:
The first part is about structure: as far as I understand it, the Skip-Gram model is based on one neural network with one input weight matrix W, one hidden layer of size N, and C output weight matrices W' each used to produce one of the C output vectors. Is this correct?
The second part is about the output vectors: as far as I understand it, each output vector is of size V and is a result of a Softmax function. Each output vector node corresponds to the index of a word in the vocabulary, and the value of each node is the probability that the corresponding word occurs at that context location (for a given input word). The target output vectors are not, however, one-hot encoded, even if the training instances are. Is this correct?
The way I imagine it is something along the following lines (made-up example):
Assuming the vocabulary ['quick', 'fox', 'jumped', 'lazy', 'dog'] and a context of C=1, and assuming that for the input word 'jumped' I see the two output vectors looking like this:
[0.2 0.6 0.01 0.1 0.09]
[0.2 0.2 0.01 0.16 0.43]
I would interpret this as 'fox' being the most likely word to show up before 'jumped' (p=0.6), and 'dog' being the most likely to show up after it (p=0.43).
Do I have this right? Or am I completely off?
Upvotes: 6
Views: 4397
Reputation: 66
Your understanding in both parts seem to be correct, according to this paper :
http://arxiv.org/abs/1411.2738
The paper explains word2vec in detail and at the same time, keeps it very simple - it's worth a read for a thorough understanding of the neural net architecture used in word2vec.
Referring to the example you mentioned, with C=1
and with a vocabulary of ['quick', 'fox', 'jumped', 'lazy', 'dog']
.
If the output from the skip-gram is [0.2 0.6 0.01 0.1 0.09]
, where the correct target word is 'fox'
then error is calculated as:
[0 1 0 0 0] - [0.2 0.6 0.01 0.1 0.09] = [-0.2 0.4 -0.01 -0.1 -0.09]
and the weight matrices are updated to minimize this error.
Upvotes: 5
Reputation: 61
No. You can set the length of the vector freely.
Then, what is the vector?
It is distributed representation of meaning of the word.
I don't understand exactly how it can be trained; but, the trained one is meaning like below.
If one vector representation like this,
[0.2 0.6 0.2]
It is closer to [0.2 0.7 0.2]
than [0.7 0.2 0.5]
.
Here is another example.
CRY [0.5 0.7 0.2]
HAPPY [-0.4 0.3 0.1]
SAD [0.4 0.6 0.2]
'CRY' is more close to 'SAD' than 'HAPPY' because the methods (CBOW or SKIP-GRAM, etc.) can make the vectors more closely when the meaning (or syntactic position) of the words are similar.
Actually, The accurate depends on many things, quite. Selecting methods are also important. and the large amount of good data (corpora), also.
If you want to check the similarity of some words, you make the vectors of words first, and check the cosine similarity of that words.
This paper explained some methods and listed accuracies.
If you can understand C codes, the word2vec program is useful. It implements CBOW (Continuous Bag-Of-Words) and SKIP-gram.
Upvotes: 0