Reputation: 1275
I have generated the vectors for a list of tokens from a large document using word2vec. Given a sentence, is it possible to get the vector of the sentence from the vector of the tokens in the sentence.
Upvotes: 97
Views: 86723
Reputation: 341
let suppose this is current sentence
import gensim
from gensim.models import Word2Vec
from gensim import models
model = gensim.models.KeyedVectors.load_word2vec_format('path of your trainig
dataset', binary=True)
strr = 'i am'
strr2 = strr.split()
print(strr2)
model[strr2] //this the the sentance embeddings.
Upvotes: 1
Reputation: 8150
Google's Universal Sentence Encoder embeddings are an updated solution to this problem. It doesn't use Word2vec but results in a competing solution.
Here is a walk-through with TFHub and Keras.
Upvotes: 5
Reputation: 1691
It depends on the usage:
1) If you only want to get sentence vector for some known data. Check out paragraph vector in these papers:
Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. Eprint Arxiv,4:1188–1196.
A. M. Dai, C. Olah, and Q. V. Le. 2015. DocumentEmbedding with Paragraph Vectors. ArXiv e-prints,July.
2) If you want a model to estimate sentence vector for unknown(test) sentences with unsupervised approach:
You could check out this paper:
3)Researcher are also looking for the output of certain layer in RNN or LSTM network, recent example is:
http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12195
4)For the gensim doc2vec, many researchers could not get good results, to overcome this problem, following paper using doc2vec based on pre-trained word vectors.
Facebook has SentEval project for evaluating the quality of sentence vectors.
https://github.com/facebookresearch/SentEval
6) There are more information in the following paper:
Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering
And for now you can use 'BERT':
Google release the source code as well as pretrained models.
https://github.com/google-research/bert
And here is an example to run bert as a service:
https://github.com/hanxiao/bert-as-service
Upvotes: 25
Reputation: 75
Deep averaging network (DAN) can provide sentence embeddings in which word bi-grams are averaged and passed through feedforward deep neural network(DNN).
It is found that transfer learning using sentence embeddings tends to outperform word level transfer as it preserves the semantic relationship.
You don't need to start the training from scratch, the pretrained DAN models are available for perusal ( Check Universal Sentence Encoder module in google hub).
Upvotes: 1
Reputation: 3910
There are several ways to get a vector for a sentence. Each approach has advantages and shortcomings. Choosing one depends on the task you want to perform with your vectors.
First, you can simply average the vectors from word2vec. According to Le and Mikolov, this approach performs poorly for sentiment analysis tasks, because it "loses the word order in the same way as the standard bag-of-words models do" and "fail[s] to recognize many sophisticated linguistic phenomena, for instance sarcasm". On the other hand, according to Kenter et al. 2016, "simply averaging word embeddings of all words in a text has proven to be a strong baseline or feature across a multitude of tasks", such as short text similarity tasks. A variant would be to weight word vectors with their TF-IDF to decrease the influence of the most common words.
A more sophisticated approach developed by Socher et al. is to combine word vectors in an order given by a parse tree of a sentence, using matrix-vector operations. This method works for sentences sentiment analysis, because it depends on parsing.
Upvotes: 40
Reputation: 30311
I've had good results from:
Upvotes: 5
Reputation: 4898
You can get vector representations of sentences during training phase (join the test and train sentences in a single file and run word2vec code obtained from following link).
Code for sentence2vec has been shared by Tomas Mikolov here. It assumes first word of a line to be sentence-id. Compile the code using
gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -funroll-loops
and run it using
./word2vec -train alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1
EDIT
Gensim (development version) seems to have a method to infer vectors of new sentences. Check out model.infer_vector(NewDocument)
method in https://github.com/gojomo/gensim/blob/develop/gensim/models/doc2vec.py
Upvotes: 10
Reputation: 9061
There are differet methods to get the sentence vectors :
Upvotes: 107
Reputation: 356
It is possible, but not from word2vec. The composition of word vectors in order to obtain higher-level representations for sentences (and further for paragraphs and documents) is a really active research topic. There is not one best solution to do this, it really depends on to what task you want to apply these vectors. You can try concatenation, simple summation, pointwise multiplication, convolution etc. There are several publications on this that you can learn from, but ultimately you just need to experiment and see what fits you best.
Upvotes: 29