Sourav Das
Sourav Das

Reputation: 51

Compare two sentences on basis of grammar using NLP

I have 2 sentences to compare on the basis of their grammar using NLP. I am completely new to NLP and want to know if there is an algorithm to determine this. I know how to compare using word similarity and sentiments.

Upvotes: 2

Views: 2556

Answers (1)

sgDysregulation
sgDysregulation

Reputation: 4417

You can use nltk wordnet's synsets to measure the similarity between two sentences

here is how to generate all possible synsets without having to specify the grammar, you can later choose which synsets to use based on a certain criterion

import pandas as pd
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import wordnet as wn
import itertools

#use stemmer 
stm = PorterStemmer()
sent1 =  "I like hot dogs"
sent2 = "My father's favourite food is hot dog"
#Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets
tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}

s1 = nltk.pos_tag(nltk.word_tokenize(sent1))

s1 = dict(filter(lambda x: len(x[1])>0,
                 map(lambda row: (row[0],wn.synsets(
                       stm.stem(row[0]),
                       tag_dict[row[1][0]])) if row[1][0] in tag_dict.keys() 
                     else (row[0],[]),s1)))

s2 = nltk.pos_tag(nltk.word_tokenize(sent2))

s2 = dict(filter(lambda x: len(x[1])>0,
                 map(lambda row: (row[0],wn.synsets(
                          stm.stem(row[0]),
                          tag_dict[row[1][0]])) if row[1][0] in tag_dict.keys() 
                     else (row[0],[]),s2)))

Here is a sample of the values in the dictionary s1

dogs    [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n...
hot     [Synset('hot.a.01'), Synset('hot.s.02'), Synset('hot.a.03'), Synset('hot.s.0...
like    [Synset('wish.v.02'), Synset('like.v.02'), Synset('like.v.03'), Synset('like...

Here is one way. here I measure similarity between all possible synsets of two words then take the max.

res = {}
for w2,gr2 in s2.items():
    for w1,gr1 in s1.items():
        tmp = pd.Series(list(map(lambda row: row[1].path_similarity(row[0]),
                                 itertools.product(gr1,gr2)))).dropna()
        if len(tmp)>0:
            res[(w1,w2)] = tmp.max()
print(res)

Output

{('dogs', 'dog'): 1.0,
 ('dogs', 'father'): 0.16666666666666666,
 ('dogs', 'food'): 0.25,
 ('dogs', 'is'): 0.10000000000000001,
 ('hot', 'hot'): 1.0,
 ('hot', 'is'): 0.33333333333333331,
 ('like', 'is'): 0.33333333333333331}

Now we find the max similarity each word in a sentence achieve. then take the mean

similarity = pd.Series(res).groupby(level=0).max().mean()
print(similarity)

the output is .778

The above is the common approach when measuring document similarity. if you're looking to compare the grammar, you may want to use A part-of-speech tagger like pos_tag (or using a tagged corpus like nltk.corpus.brown.tagged_words()) on both sentences then find the Jaccard distance between the tags.

Upvotes: 2

Related Questions