user8959427
user8959427

Reputation: 2067

Can I obtain Word2Vec and Doc2Vec matrices to calculate a cosine similarity?

I am working with text data and at the moment I have put my data into a term document matrix and calculated the TF, term frequency and TF-IDF, term frequency inverse document frequency. From here my matrix looks like:

columns = document names

rownames = words

filled with their TF and TF-IDF scores.

I have been using the tm package in R for much of my current analysis but to take it further I have started playing around with the gensim library in Python.

Its not clear to me if I have the word embeddings as in the TF and TF-IDF. I am hopeing to use Word2Vec/Doc2Vec and obtain a matrix similar to what I currently have and then calculate the cosine similarity between document. Is this one of the outputs of the models?

I basically have about 6000 documents I want to calculate the cosine similarity between them and then rank these cosine similarity scores.

Upvotes: 0

Views: 2296

Answers (2)

gojomo
gojomo

Reputation: 54173

Yes, you could train a Word2Vec or Doc2Vec model on your texts. (Though, your data is a bit small for these algorithms.)

Afterwards, with a Word2Vec model (or some modes of Doc2Vec), you would have word-vectors for all the words in your texts. One simple way to then create a vector for a longer text is to average together all the vectors for the text's individual words. Then, with a vector for each text, you can compare texts by calculating the cosine-similarity of their vectors.

Alternatively, with a Doc2Vec model, you can either (a) look up the learned doc-vectors for texts that were in the training set; or (b) use infer_vector() to feed in new text, which should be tokenized the same way as the training data, and get a model-compatible vector for that new text.

Upvotes: 1

Kurtis Streutker
Kurtis Streutker

Reputation: 1317

Documentation says it returns the inferred paragraph vector for the new document. Note that subsequent calls to this function may infer different representations for the same document (you can make it deterministic by hardcoding a seed model.random.seed(0)).

Its more common to use sklearn for tfidf and cosine similarity

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = [
     'This is the first document',
     'This is the second second document',
     'And the third one',
]

vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(corpus)
words = vectorizer.get_feature_names()
similarity_matrix = cosine_similarity(tfidf)

Doc2Vec uses consine similarity under the hood so I believe you can use these vectors for that purpose.

import gensim  

model = gensim.models.Doc2Vec.load('saved_doc2vec_model')  
new_sentence = "This is a sample document".split(" ")  
model.docvecs.most_similar([model.infer_vector(new_sentence)])

This will return a tuple (label,cosine_similarity_score) of the most similar documents.

Hope this helps.

Upvotes: 2

Related Questions