PiccolMan
PiccolMan

Reputation: 5366

Text similarity with gensim and cosine similarity

from gensim import corpora, models, similarities

documents = ["This is a book about cars, dinosaurs, and fences"]

# remove common words and tokenize
stoplist = set('for a of the and to in - , is'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# Remove commas
texts[0] = [text.replace(',','') for text in texts[0]]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

doc = "I like cars and birds"
vec_bow = dictionary.doc2bow(doc.lower().split())

vec_lsi = lsi[vec_bow] 
index = similarities.MatrixSimilarity(lsi[corpus]) 

sims = index[vec_lsi] # perform a similarity query against the corpus
print(sims)

In the above code I am comparing how much "This is a book about cars, dinosaurs, and fences" is similar to "I like cars and birds" using the cosine similarity technique.

The two sentences have effectively 1 words in common, which is "cars", however when I run the code I get that they are 100% similar. This does not make sense to me.

Can someone suggest how to improve my code so that I get a reasonable number?

Upvotes: 1

Views: 4383

Answers (1)

gojomo
gojomo

Reputation: 54243

These topic-modelling techniques need varied, realistic data to achieve sensible results. Toy-sized examples of just one or a few text examples don't work well – and even if they do, it's often just good luck or contrived suitability.

In particular:

  • a model with only one example can't sensibly create multiple topics, as there's no contrast-between-documents to model

  • a model presented with words it hasn't seen before ignores those words, so your test doc appears to it the same as the single word 'cars' – the only word it's seen before

In this case, both your single training document, and the test document, get modeled by LSI as having 0 contribution from the 0th topic, and positive contribution (of different magnitudes) from the 1st topic. Since cosine-similarity merely compares angle, and not magnitude, both docs are along-the-same-line-from-the-origin, and so have no angle-of-difference, and thus similarity 1.0.

But if you had better training data, and more than a single-known-word test doc, you might start to get more sensible results. Even a few dozen training docs, and a test doc with several known words, might help... but hundreds or thousands or tens-of-thousands training-docs would be even better.

Upvotes: 1

Related Questions