Reputation: 2515
I have been reading a lot of papers on NLP, and came across many models. I got the SVD Model and representing it in 2-D, but I still did not get how do we make a word vector by giving a corpus to the word2vec/skip-gram model? Is it also co-occurrence matrix representation for each word? Can you explain it by taking an example corpus:
Hello, my name is John.
John works in Google.
Google has the best search engine.
Basically, how does skip gram convert John
to a vector?
Upvotes: 17
Views: 7771
Reputation: 2698
First of all, you normally don't use SVD with Skip-Gram model, because Skip-Gram is based on neural network. You use SVD because you want to reduce the dimension of your word vector (ex: for visualization on 2D or 3D space), but in neural net you construct your embedding matrices with the dimension of your choice. You use SVD if you constructed your embedding matrix with co-occurrence matrix.
Vector representation with co-occurrence matrix
I wrote an article about this here.
Consider the following two sentences: "all that glitters is not gold" + "all is well that ends well"
Co-occurrence matrix is then:
With co-occurrence matrix, each row is a word vector for the word. However as you can see in the matrix constructed above, each row has 10 columns. This means that the word vectors are 10-dimensional, and can't be visualized in 2D or 3D space. So we run SVD to reduce it to 2 dimension:
Now that the word vectors are 2-dimensional, they can be visualized in a 2D space:
However, reducing the word vectors into 2D matrix results in significant loss of meaningful data, which is why you shouldn't reduce it down too much.
Lets take another example: achieve
and success
. Lets say they have 10-dimensional word vectors:
Since achieve
and success
convey similar meanings, their vector representations are similar. Notice their similar values & color band pattern. However, since these are 10-dimensional vectors, these can't be visualized. So we run SVD to reduce the dimension to 3D, and visualize them:
Each value in the word vector represents the word's position within the vector space. Similar words will have similar vectors, and as a result, will be placed closed with each other in the vector space.
Vector representation with Skip-Gram
I wrote an article about it here.
Skip-Gram uses neural net, and therefore does not use SVD because you can specify the word vector's dimension as a hyper-parameter when you first construct the network (if you really need to visualize, then we use a special technique called t-SNE, but not SVD).
Skip-Gram as the following structure:
With Skip-Gram, N
-dimensional word vectors are randomly initialized. There are two embedding matrices: input weight matrix W_input
and output weight matrix W_output
Lets take W_input
as an example. Assume that the words of your interest are passes
and should
. Since the randomly initialized weight matrix is 3-dimensional, they can be visualized:
These weight matrices (W_input
, and W_ouput
) are optimized by predicting a center word's neighboring words, and updating the weights in a way that minimizes prediction error. The predictions are computed for each context words of a center word, and their prediction errors are summed up to calculate weight gradients
The weight matrices update equations are:
These updates are applied for each training sample within the corpus (since Word2Vec uses stochastic gradient descent).
Vanilla Skip-Gram vs Negative Sampling
The above Skip-Gram illustration assumes that we use vanilla Skip-Gram. In real-life, we don't use vanilla Skip-Gram because of its high computational cost. Instead, we use an adapted form of Skip-Gram, called negative sampling.
Upvotes: 2
Reputation: 721
I think you will need to read a paper about the training process. Basically the values of the vectors are the node values of the trained neural network.
I tried to read the original paper but I think the paper "word2vec Parameter Learning Explained" by Xin Rong has a more detailed explanation.
Upvotes: 11
Reputation: 8237
For current readers who might also be wondering "what does a word vector exactly mean" as the OP was at that time: As described at http://cs224d.stanford.edu/lecture_notes/LectureNotes1.pdf, a word vector is of dimension n
, and n
"is an arbitrary size which defines the size of our embedding space." That is to say, this word vector doesn't mean anything concretely. It's just an abstract representation of certain qualities that this word might have, that we can use to distinguish words.
In fact, to directly answer the original question of "how is a word converted to a vector representation", the values of a vector embedding for a word is usually just randomized at initialization, and improved iteration-by-iteration.
This is common in deep learning/neural networks, where the human beings who created the network themselves usually don't have much idea about what the values exactly stand for. The network itself is supposed to figure the values out gradually, through learning. They just abstractly represent something and distinguish stuffs. One example would be AlphaGo, where it would be impossible for the DeepMind team to explain what each value in a vector stands for. It just works.
Upvotes: 2
Reputation: 1696
The main concept can be easily understood with an example of Autoencoding with neural networks. You train the neural network to pass information from the input layer to the output layer through the middle layer which is smaller.
In a traditional auto encoder, you have an input vector of size N, a middle layer of length M<N, and the output layer,again of size N. You want only one unit at a time turned on in you input layer and you train the network to replicate in the output layer the same unit that is turned on in the input layer.
After the training has completed succesfully you will see that the neural network, to transport the information from the input layer to the output layer, adapted itself so that each input unit has a corresponding vector representation in the middle layer .
Simplifying a bit, in the context of word2vec your input and output vectors work more or less in the same way, except for the fact that in the sample you submit to the network the unit turned on in the input layer is different from the unit turned on in the output layer.
In fact you train the network picking pairs of nearby (not necessarily adjacent) words from your corpus and submitting them to the network.
The size of the input and output vector is equal to the size of the vocabulary you are feeding to the network.
Your input vector has only one unit turned on (the one corresponding to the first word of the chosen pair) the output vector has one unit turned on (the one corresponding to the second word of chosen pair).
Upvotes: 10