Reputation: 583
Here are samples data:
name age gender school
Michael Z 21 Male Lasalle
Lisa M 22 Female Ateneo
James T 21 Male UP
Michael Z. 23 Male TUP
Here are the expected output I need:
name age gender similar name on_lasalle on_ateneo on_up on_tup
Michael Z 21 Male Michael Z. True False False True
Lisa M 22 Female False True False False
James T 21 Male False False True False
I've been trying to use fuzzywuzzy
on my python script. The data I am getting is coming from bigquery
, then I am comverting it to dataframe
to clean some stuff. After that, I am converting the dataframe
to a list of dictionaries
.
Notice the above data where Michael Z. from TUP was appended to Michael Z from school Lasalle since they have similar names with 100% similarity rate using fuzz.token_set_ratio
What I want is to get all similar rows base on names and append it to the current dictionary we are looking at (including their school).
Here is the code and the loop to get similar rows base on names:
data_dict_list = data_df.to_dict('records')
for x in range(0, len(data_dict_list)):
for y in range(x, len(data_dict_list)):
if not data_dict_list[x]['is_duplicate']:
similarity = fuzz.token_set_ratiod(data_dict_list[x]['name'], data_dict_list[y]['name'])
if similarity >= 90:
data_dict_list[x]['similar_names'].update('similar_name': data_dict_list[y]['name'])
...
data_dict_list[x]['is_duplicate'] = True
The runtime of this script is very slow, as sometimes, I am getting 100,000+ data !!! So it will loop through all of that data.
How will I be able to speed up the process of this?
Suggesting pandas
is much appreciated as I am having a hard time figuring out how to loop data in it.
Upvotes: 1
Views: 392
Reputation: 3265
As a first step you can simply replace the import of fuzzywuzzy
with rapidfuzz
:
from rapidfuzz import fuzz
which should already improve the performance quite a bit. You can further improve the performance by comparing complete lists of strings in rapidfuzz in the following way:
>> import pandas as pd
>> from rapidfuzz import process, fuzz
>> df = pd.DataFrame(data={'name': ['test', 'tests']})
>> process.cdist(df['name'], df['name'], scorer=fuzz.token_set_ratio, score_cutoff=90)
array([[100, 0],
[ 0, 100]], dtype=uint8)
which returns a matrix of result where all elements with a score below 90 are set to 0. For large datasets you can enable multithreading using the workers
argument:
process.cdist(df['name'], df['name'], workers=-1, scorer=fuzz.token_set_ratio, score_cutoff=90)
Upvotes: 2