udit_kumar
udit_kumar

Reputation: 66

How to get probability if two strings means the same

I already visited almost every posts related to this but most of them are calculating the probability on basis of similar words but is there any way of getting the probability if two statements are same in meaning but may contain different words. Eg. "Python is the right option for ML" and "Best language for Machine Learning is Python". This example should return True( the probability should be atleast 0.5 in this) as both the sentences means the same.

In this code its just calculating the similarity using similar words present in both the strings.

# Program to measure the similarity between 
# two sentences using cosine similarity. 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

# X = input("Enter first string: ").lower() 
# Y = input("Enter second string: ").lower() 
X ="I love horror movies"
Y ="Lights out is a horror movie"

# tokenization 
X_list = word_tokenize(X) 
Y_list = word_tokenize(Y) 

# sw contains the list of stopwords 
sw = stopwords.words('english') 
l1 =[];l2 =[] 

# remove stop words from the string 
X_set = {w for w in X_list if not w in sw} 
Y_set = {w for w in Y_list if not w in sw} 

# form a set containing keywords of both strings 
rvector = X_set.union(Y_set) 
for w in rvector: 
    if w in X_set: l1.append(1) # create a vector 
    else: l1.append(0) 
    if w in Y_set: l2.append(1) 
    else: l2.append(0) 
c = 0

# cosine formula 
for i in range(len(rvector)): 
        c+= l1[i]*l2[i] 
cosine = c / float((sum(l1)*sum(l2))**0.5) 
print("similarity: ", cosine) 

Thanks in Advance.

Upvotes: 0

Views: 835

Answers (1)

Akshay Sehgal
Akshay Sehgal

Reputation: 19322

If you want to work with text semantics, you will have to work with algorithms that deal with it such as word2vec, glove, fasttext.

You can use pre-trained word embedding that has been trained on a lot of data and encodes the contextual/semantic similarities between words based on their co-occurrence with other words in sentences.

These word embeddings are n-dimensional vector representations of a large vocabulary of words. These vectors can be summed up to create a representation of the sentence's embedding. Sentences with word with similar semantics will have similar vectors, and thus their sentence embeddings will also be similar.

You can check similarity between these sentence embeddings using cosine_similarity

from scipy import spatial
import gensim.downloader as api

model = api.load("glove-wiki-gigaword-50") #choose from multiple models https://github.com/RaRe-Technologies/gensim-data

s0 = 'Mark zuckerberg owns the facebook company'
s1 = 'Facebook company ceo is mark zuckerberg'
s2 = 'Microsoft is owned by Bill gates'
s3 = 'How to learn japanese'

def preprocess(s):
    return [i.lower() for i in s.split()]

def get_vector(s):
    return np.sum(np.array([model[i] for i in preprocess(s)]), axis=0)


print('s0 vs s1 ->',1 - spatial.distance.cosine(get_vector(s0), get_vector(s1)))
print('s0 vs s2 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s2)))
print('s0 vs s3 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s3)))
#Semantic similarity between sentence pairs
s0 vs s1 -> 0.965923011302948
s0 vs s2 -> 0.8659112453460693
s0 vs s3 -> 0.5877998471260071

You can choose better models than the one I used from here - https://github.com/RaRe-Technologies/gensim-data

Upvotes: 1

Related Questions