R.Wedisa
R.Wedisa

Reputation: 107

How to Compare Sentences with an idea of the positions of keywords?

I want to compare the two sentences. As a example, sentence1="football is good,cricket is bad" sentence2="cricket is good,football is bad"

Generally these senteces have no relationship that means they are different meaning. But when I compare with python nltk tools it will give 100% similarity. How can I fix this Issue? I need Help.

Upvotes: 0

Views: 696

Answers (2)

CLpragmatics
CLpragmatics

Reputation: 645

Semantic Similarity is a bit tricky this way, since even if you use context counts (which would be n-grams > 5) you cannot cope with antonyms (e.g. black and white) well enough. Before using different methods, you could try using a shallow parser or dependency parser for extracting subject-verb or subject-verb-object relations (e.g. ), which you can use as dimensions. If this does not give you the expected similarity (or values adequate for your application), use word embeddings trained on really large data.

Upvotes: 1

ashutosh singh
ashutosh singh

Reputation: 533

Yes wup_similarity internally uses synsets for single tokens to calculate similarity

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).

Since ancestor nodes for cricket and football would be same. wup_similarity will return 1.

If you want to fix this issue using wup_similarity is not a good choice. Simplest token based way would be fitting a vectorizer and then calculating similarity. Eg.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = ["football is good,cricket is bad", "cricket is good,football is bad"]
vectorizer = CountVectorizer(ngram_range=(1, 3))
vectorizer.fit(corpus)

x1 = vectorizer.transform(["football is good,cricket is bad"])
x2 = vectorizer.transform(["cricket is good,football is bad"])

cosine_similarity(x1, x2)

There are more intelligent methods to meaure semantic similarity though. One of them which can be tried easily is Google's USE Encoder. See this link

Upvotes: 1

Related Questions