Reputation: 354
I have a training dataset for eg.
Letter Word
A Apple
B Bat
C Cat
D Dog
E Elephant
and I need to check the dataframe such as
AD Apple Dog
AE Applet Elephant
DC Dog Cow
EB Elephant Bag
AED Apple Elephant Dog
D Door
ABC All Bat Cat
the instances AD,AE,EB
are almost accurate (Apple and Applet are considered closer to each other, similar for Bat and Bag) but DC
doesn't match.
Output Required:
Letters Words Status
AD Apple Dog Accept
AE Applet Elephant Accept
DC Dog Cow Reject
EB Elephant Bag Accept
AED Apple Elephant Dog Accept
D Door Reject
ABC All Bat Cat Accept
ABC
accepted because 2 of 3 words match.
The words accepted need to be matched 70% (Fuzzy Match). yet, threshold subject to change. How can I find these matches using Python.
Upvotes: 0
Views: 1949
Reputation: 120409
You can use thefuzz
to solve your problem:
# Python env: pip install thefuzz
# Conda env: conda install thefuzz
from thefuzz import fuzz
THRESHOLD = 70
df2['Others'] = (df2['Letters'].agg(list).explode().reset_index()
.merge(df1, left_on='Letters', right_on='Letter')
.groupby('index')['Word'].agg(' '.join))
df2['Ratio'] = df2.apply(lambda x: fuzz.ratio(x['Words'], x['Others']), axis=1)
df2['Status'] = np.where(df2['Ratio'] > THRESHOLD, 'Accept', 'Reject')
Output:
>>> df2
Letters Words Others Ratio Status
0 AD Apple Dog Apple Dog 100 Accept
1 AE Applet Elephant Apple Elephant 97 Accept
2 DC Dog Cow Dog Cat 71 Accept
3 EB Elephant Bag Elephant Bat 92 Accept
4 AED Apple Elephant Dog Apple Dog Elephant 78 Accept
5 D Door Dog 57 Reject
6 ABC All Bat Cat Apple Cat Bat 67 Reject
Upvotes: 1