Reputation: 7255
Here's my first dataframe df1
Id Text
1 Asoy Geboy Ngebut
2 Asoy kita Geboy
3 Bersatu kita Teguh
Here's my second dataframe df2
Id Text
1 Bersatu Kita
2 Asoy Geboy Jalanan
Similarity Matrix, columns is Id
from df1
, rows is Id
from df2
1 2 3
1 0 0.33 1
2 0.66 0.66 0
Note:
0
value in (1,1) and (3,2) because no text similar
1
value in (3,1) is because of Bersatu
and Kita' (Id
1on
df2is avalilable in Id
3on
df1`
0.33
is counted because of 1 of 3 words similar
0.66
is counted because of 2 of 3 words similar
Upvotes: 1
Views: 320
Reputation: 260640
IIUC, you need to compute a set
intersection:
l1 = [set(x.split()) for x in df1['Text'].str.lower()]
l2 = [set(x.split()) for x in df2['Text'].str.lower()]
pd.DataFrame([[len(s1&s2)/len(s1) for s1 in l1] for s2 in l2],
columns=df1['Id'], index=df2['Id'])
output:
Id 1 2 3
Id
1 0.000000 0.333333 0.666667
2 0.666667 0.666667 0.000000
NB. Note that the condition on the denominator is not fully clear, for {teguh, kita, bersatu}
vs {kita, bersatu}
I count 2/3 = 0.666
Upvotes: 1