Reputation: 1477
I have two dataframes of unequal length and would like to compare the similarity of strings in df2 with df1. Is it possible to apply Jaro-Winkler distance method to calculate the string similarity on two dataframes through map/lambda function.
df1
Behavioral disorders
Behçet disease
AV-Block
df2
Behavioral disorder
Behçet syndrome
The desired output is:
name_left name_right score
Behavioral disorders Behavioral disorder 0.933333
Behçet disease Behçet syndrome 0.865342
The scores mentioned above are hypothetical. Any help is highly appreciated
Upvotes: 0
Views: 1627
Reputation: 260290
Assuming you want the max score and that the original columns in the input are "name":
# pip install jaro-winkler
# https://pypi.org/project/jaro-winkler/
from jaro import jaro_winkler_metric as jw
pd.DataFrame([[n2, *max([(n1, jw(n1, n2)) for n1 in df1['name']],
lambda x: x[1])]
for n2 in df2['name']],
index=df2.index,
columns=['name_right', 'name_left', 'score']
)[['name_left', 'name_right', 'score']]
Upvotes: 0