Reputation: 5366
from gensim import corpora, models, similarities
documents = ["This is a book about cars, dinosaurs, and fences"]
# remove common words and tokenize
stoplist = set('for a of the and to in - , is'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# Remove commas
texts[0] = [text.replace(',','') for text in texts[0]]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "I like cars and birds"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi] # perform a similarity query against the corpus
print(sims)
In the above code I am comparing how much "This is a book about cars, dinosaurs, and fences" is similar to "I like cars and birds" using the cosine similarity technique.
The two sentences have effectively 1 words in common, which is "cars", however when I run the code I get that they are 100% similar. This does not make sense to me.
Can someone suggest how to improve my code so that I get a reasonable number?
Upvotes: 1
Views: 4383
Reputation: 54243
These topic-modelling techniques need varied, realistic data to achieve sensible results. Toy-sized examples of just one or a few text examples don't work well – and even if they do, it's often just good luck or contrived suitability.
In particular:
a model with only one example can't sensibly create multiple topics, as there's no contrast-between-documents to model
a model presented with words it hasn't seen before ignores those words, so your test doc appears to it the same as the single word 'cars' – the only word it's seen before
In this case, both your single training document, and the test document, get modeled by LSI as having 0
contribution from the 0th topic, and positive contribution (of different magnitudes) from the 1st topic. Since cosine-similarity merely compares angle, and not magnitude, both docs are along-the-same-line-from-the-origin, and so have no angle-of-difference, and thus similarity 1.0.
But if you had better training data, and more than a single-known-word test doc, you might start to get more sensible results. Even a few dozen training docs, and a test doc with several known words, might help... but hundreds or thousands or tens-of-thousands training-docs would be even better.
Upvotes: 1