Nishant
Nishant

Reputation: 1121

How to calculate pairwise Jaccard similarity score for every row in a data frame using python

I have a DF as below:

df=pd.DataFrame.from_dict({"q1":['What is the step by step guide to invest in share market in india?',
                                'What is the story of Kohinoor (Koh-i-Noor) Diamond?',
                                'How can I increase the speed of my internet connection while using a VPN?',
                                'Why am I mentally very lonely? How can I solve it?',
                                'Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?'],
                          "q2":['What is the step by step guide to invest in share market?',
                                'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?',
                                'How can Internet speed be increased by hacking through DNS?',
                                'Find the remainder when [math]23^{24}[/math] is divided by 24,23?',
                                'Which fish would survive in salt water?']})

df

I am trying to find Jaccard similarity score between each pair of sentences of q1 and q2 columns iteratively (map or apply functions using list comprehension) (create a new coulmn jac_q1_q2.

For a single row , it can be done as :

import nltk

jd_sent_1_2 = nltk.jaccard_distance(set(df['q1'][0]), set(df['q2'][0]))

jd_sent_1_2
>0.0

Thanks

Upvotes: 0

Views: 403

Answers (2)

Nishant
Nishant

Reputation: 1121

One can use data_sim['jac_sim'] = [nltk.jaccard_distance(text1, text2) for text1, text2 in zip(data_sim['q1'], data_sim['q2'])]

Upvotes: 0

JEFFRIN JACOB
JEFFRIN JACOB

Reputation: 267

It can be done using apply() and lambda functions

scores = df.apply(lambda row: nltk.jaccard_distance(set(row['q1']), set(row['q2']), axis=1)

Upvotes: 1

Related Questions