spd
spd

Reputation: 354

Fuzzy String Matching using Python

I have a training dataset for eg.

Letter    Word
A         Apple
B         Bat
C         Cat
D         Dog
E         Elephant

and I need to check the dataframe such as

AD    Apple Dog
AE    Applet Elephant
DC    Dog Cow
EB    Elephant Bag
AED   Apple Elephant Dog  
D     Door                
ABC   All Bat Cat         

the instances AD,AE,EB are almost accurate (Apple and Applet are considered closer to each other, similar for Bat and Bag) but DC doesn't match.

Output Required:

Letters    Words               Status
AD         Apple Dog           Accept
AE         Applet Elephant     Accept
DC         Dog Cow             Reject
EB         Elephant Bag        Accept
AED        Apple Elephant Dog  Accept
D          Door                Reject
ABC        All Bat Cat         Accept

ABC accepted because 2 of 3 words match.

The words accepted need to be matched 70% (Fuzzy Match). yet, threshold subject to change. How can I find these matches using Python.

Upvotes: 0

Views: 1949

Answers (1)

Corralien
Corralien

Reputation: 120409

You can use thefuzz to solve your problem:

# Python env: pip install thefuzz
# Conda env: conda install thefuzz
from thefuzz import fuzz

THRESHOLD = 70

df2['Others'] = (df2['Letters'].agg(list).explode().reset_index()
                     .merge(df1, left_on='Letters', right_on='Letter')
                     .groupby('index')['Word'].agg(' '.join))

df2['Ratio'] = df2.apply(lambda x: fuzz.ratio(x['Words'], x['Others']), axis=1)
df2['Status'] = np.where(df2['Ratio'] > THRESHOLD, 'Accept', 'Reject')

Output:

>>> df2
  Letters               Words              Others  Ratio  Status
0      AD           Apple Dog           Apple Dog    100  Accept
1      AE     Applet Elephant      Apple Elephant     97  Accept
2      DC             Dog Cow             Dog Cat     71  Accept
3      EB        Elephant Bag        Elephant Bat     92  Accept
4     AED  Apple Elephant Dog  Apple Dog Elephant     78  Accept
5       D                Door                 Dog     57  Reject
6     ABC         All Bat Cat       Apple Cat Bat     67  Reject

Upvotes: 1

Related Questions