Reputation: 335
I am using fuzzywuzzy in python for fuzzy string matching. I have a set of names in a list named HKCP_list which I am matching against a pandas column iteratively to get the best possible match. Given below is the code for it
import fuzzywuzzy
from fuzzywuzzy import fuzz,process
def search_func(row):
chk = process.extract(row,HKCP_list,scorer=fuzz_token_sort_ratio)[0]
return chk
wc_df['match']=wc_df['concat_name'].map(search_func)
The wc_df dataframe contains the column 'concat_name' which needs to be matched with every name in the list HKCP_list. The above code took around 2 hours to run with 6K names in the list and 11K names in the column 'concat_name'.
I have to rerun this on another data set where are 89K names in the list and 120K names in the column. In order to speed up the process, I got an idea in the following question on Stackoverflow
Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column
In one of the comments in the answer in the above, it has been advised to compare names that have the same 1st letter. The 'concat_name' column that I am comparing with is a derived column obtained by concatenating 'first_name' and 'last_name' columns in the dataframe. Hence I am using the following function to match the 1st letter (since this is a token sort score that I am considering, I am comparing the 1st letter of both the first_name and last_name with the elements in the list). Given below is the code:
wc_df['first_name_1stletter'] = wc_df['first_name'].str[0]
wc_df['last_name_1stletter'] = wc_df['last_name'].str[0]
import time
start_time=time.time()
def match_func(row):
CP_subset=[x for x in HKCP_list if x[0]==row['first_name_1stletter'] or x[0]==row['last_name_1stletter']]
return CP_subset
wc_df['list_to_match']=wc_df.apply(match_func,axis=1)
end_time=time.time()
print(end_time-start_time)
The above step took 1600 second with 6K X 11K data. The 'list_to_match' column contains the list of names to be compared for each concat_name. Now here I have to again take the list_to_match element and pass individual elements in a list and do the fuzzy string matching using the process.extract method. Is there a more elegant and faster way of doing this in the same step as above?
PS: Editing this to add an example as to how the list and the dataframe column looks like.
HKCp_list=['jeff bezs','michael blomberg','bill gtes','tim coook','elon musk']
concat_name=['jeff bezos','michael bloomberg','bill gates','tim cook','elon musk','donald trump','kim jong un', 'narendra modi','michael phelps']
first_name=['jeff','michael','bill','tim','elon','donald','kim','narendra','michael']
last_name=['bezos','bloomberg','gates','cook','musk','trump','jong un', 'modi','phelps']
import pandas as pd
df=pd.DataFrame({'first_name':first_name,'last_name':last_name,'concat_name':concat_name})
Each row of the 'concat_name' in df has to be compared against the elements of HKcp_list.
PS: editing today to reflect the ":" and the row in the 2nd snippet of code I missed yesterday
Regards, Nirvik
Upvotes: 1
Views: 1200
Reputation: 335
Given below is the code that I have used to make the comparison of list dynamic for each instance:
import fuzzywuzzy
from fuzzywuzzy import fuzz,process
wc_df['first_name_1stletter'] = wc_df['first_name'].str[0]
wc_df['last_name_1stletter'] = wc_df['last_name'].str[0]
import time
start_time=time.time()
def match_func(row):
CP_subset=[x for x in HKCP_list if x[0]==row['first_name_1stletter'] or x[0]==row['last_name_1stletter']]
if len(CP_subset)>0:
chk=process.extract(row['concat_name'],CP_subset,scorer=fuzz.token_sort_ratio)[0]
else:
chk = "No item to match"
return chk
wc_df['match']=wc_df.apply(match_func,axis=1)
end_time=time.time()
print(end_time-start_time)
The above code for 6K X 11K comparisons took around 2600 seconds instead of the 7000 seconds as per the 1st snippet of the code posted in the question.
Upvotes: 1
Reputation: 42916
You can try this function I wrote in another answer, not 100% sure how it holds in terms of speed, you can try for yourself:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# Make dataframe out of list
HKCp = pd.DataFrame({'names':HKCp_list})
# Use fuzzy_merge function
fuzzy_merge(df, HKCp, 'concat_name', 'names')
Output
first_name last_name concat_name matches
0 jeff bezos jeff bezos jeff bezs
1 michael bloomberg michael bloomberg michael blomberg
2 bill gates bill gates bill gtes
3 tim cook tim cook tim coook
4 elon musk elon musk elon musk
5 donald trump donald trump
6 kim jong un kim jong un
7 narendra modi narendra modi
8 michael phelps michael phelps
Note you can play with the treshold
argument to get less exact matches
Upvotes: 1