Nirvik Banerjee
Nirvik Banerjee

Reputation: 335

Looking for a quicker way of fuzzy string matching

I am using fuzzywuzzy in python for fuzzy string matching. I have a set of names in a list named HKCP_list which I am matching against a pandas column iteratively to get the best possible match. Given below is the code for it

import fuzzywuzzy
from fuzzywuzzy import fuzz,process

def search_func(row):
    chk = process.extract(row,HKCP_list,scorer=fuzz_token_sort_ratio)[0]
    return chk

wc_df['match']=wc_df['concat_name'].map(search_func)

The wc_df dataframe contains the column 'concat_name' which needs to be matched with every name in the list HKCP_list. The above code took around 2 hours to run with 6K names in the list and 11K names in the column 'concat_name'.

I have to rerun this on another data set where are 89K names in the list and 120K names in the column. In order to speed up the process, I got an idea in the following question on Stackoverflow

Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column

In one of the comments in the answer in the above, it has been advised to compare names that have the same 1st letter. The 'concat_name' column that I am comparing with is a derived column obtained by concatenating 'first_name' and 'last_name' columns in the dataframe. Hence I am using the following function to match the 1st letter (since this is a token sort score that I am considering, I am comparing the 1st letter of both the first_name and last_name with the elements in the list). Given below is the code:

wc_df['first_name_1stletter'] = wc_df['first_name'].str[0]
wc_df['last_name_1stletter'] = wc_df['last_name'].str[0]

import time
start_time=time.time()
def match_func(row):
    CP_subset=[x for x in HKCP_list if x[0]==row['first_name_1stletter'] or x[0]==row['last_name_1stletter']]
    return CP_subset
wc_df['list_to_match']=wc_df.apply(match_func,axis=1)
end_time=time.time()
print(end_time-start_time)

The above step took 1600 second with 6K X 11K data. The 'list_to_match' column contains the list of names to be compared for each concat_name. Now here I have to again take the list_to_match element and pass individual elements in a list and do the fuzzy string matching using the process.extract method. Is there a more elegant and faster way of doing this in the same step as above?

PS: Editing this to add an example as to how the list and the dataframe column looks like.

HKCp_list=['jeff bezs','michael blomberg','bill gtes','tim coook','elon musk'] 
concat_name=['jeff bezos','michael bloomberg','bill gates','tim cook','elon musk','donald trump','kim jong un', 'narendra modi','michael phelps']
first_name=['jeff','michael','bill','tim','elon','donald','kim','narendra','michael']
last_name=['bezos','bloomberg','gates','cook','musk','trump','jong un', 'modi','phelps']
import pandas as pd
df=pd.DataFrame({'first_name':first_name,'last_name':last_name,'concat_name':concat_name})

Each row of the 'concat_name' in df has to be compared against the elements of HKcp_list.

PS: editing today to reflect the ":" and the row in the 2nd snippet of code I missed yesterday

Regards, Nirvik

Upvotes: 1

Views: 1200

Answers (2)

Nirvik Banerjee
Nirvik Banerjee

Reputation: 335

Given below is the code that I have used to make the comparison of list dynamic for each instance:

import fuzzywuzzy
from fuzzywuzzy import fuzz,process

wc_df['first_name_1stletter'] = wc_df['first_name'].str[0]
wc_df['last_name_1stletter'] = wc_df['last_name'].str[0]

import time
start_time=time.time()
def match_func(row):

    CP_subset=[x for x in HKCP_list if x[0]==row['first_name_1stletter'] or x[0]==row['last_name_1stletter']]
    if len(CP_subset)>0:
        chk=process.extract(row['concat_name'],CP_subset,scorer=fuzz.token_sort_ratio)[0]
    else:
        chk = "No item to match"

    return chk

wc_df['match']=wc_df.apply(match_func,axis=1)

end_time=time.time()
print(end_time-start_time)

The above code for 6K X 11K comparisons took around 2600 seconds instead of the 7000 seconds as per the 1st snippet of the code posted in the question.

Upvotes: 1

Erfan
Erfan

Reputation: 42916

You can try this function I wrote in another answer, not 100% sure how it holds in terms of speed, you can try for yourself:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Make dataframe out of list
HKCp = pd.DataFrame({'names':HKCp_list})

# Use fuzzy_merge function
fuzzy_merge(df, HKCp, 'concat_name', 'names')

Output

  first_name  last_name        concat_name           matches
0       jeff      bezos         jeff bezos         jeff bezs
1    michael  bloomberg  michael bloomberg  michael blomberg
2       bill      gates         bill gates         bill gtes
3        tim       cook           tim cook         tim coook
4       elon       musk          elon musk         elon musk
5     donald      trump       donald trump                  
6        kim    jong un        kim jong un                  
7   narendra       modi      narendra modi                  
8    michael     phelps     michael phelps                  

Note you can play with the treshold argument to get less exact matches

Upvotes: 1

Related Questions