Anders
Anders

Reputation: 115

Finding the cosine similarity of two sentences using LSA

I am trying to use Latent Semantic Indexing to produce the cosine similarity between two sentences based on the topics produced from a large corpus but I'm struggling to find any tutorials that do exactly what I'm looking for - the closest I've found is Semantic Similarity between Phrases Using GenSim but I'm not looking to find the most similar sentence to a query, I specifically want to use an LSI model to reduce the dimensionality of two sentences and then measure the cosine similarity of the two sentences. Please can someone help?

From the quoted article, I thought I might be looking at the below code and then having the cosine similarity calculation? But I'm stuck.

import gensim
from gensim import corpora, models, similarities
from gensim.models import LsiModel

# texts = list of list of words from a database
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=400)
doc_1 = "Mary and Samantha arrived at the bus station early but waited until noon for the bus"
doc_2 = "when the seagulls follow the trawler, it is because they think sardines will be dropped in the sea"
vec_bow_1 = dictionary.doc2bow(doc_1.lower().split())
vec_bow_2 = dictionary.doc2bow(doc_2.lower().split())
vec_lsi_1 = lsi[vec_bow_1]
vec_lsi_2 = lsi[vec_bow_2]

Upvotes: 0

Views: 694

Answers (1)

gojomo
gojomo

Reputation: 54153

If you've succeeded in making vec_lsi_1 a vector for your doc1, and vec_lsi_2 a vector for your doc2, have you tried simply calculating the cosine-similarity between those two vectors? Cosine similarity is calculated by taking the dot-product of two vetors, then dividing that by their unit-norms. EG:

import numpy as np

cossim = (
           np.dot(vec_lsi_1, vec_lsi_2) 
           /
           (np.linalg.norm(vec_lsi_1) * np.linalg.norm(vec_lsi_2))
         )

Update for completeness: If vec_lsi_1 etc are sparse vectors – some form of list of (index, value) where unmentioned indexes are assumed to be 0.0 – then np.dot() may not work directly; see https://radimrehurek.com/gensim/matutils.html#gensim.matutils.sparse2full for a helper function to turn Gensim's sparse format into a dense numpy vector.

Upvotes: 0

Related Questions