Tenserflu
Tenserflu

Reputation: 583

Fastest way to detect and append duplicates base on specific column in dataframe

Here are samples data:

name         age       gender     school
Michael Z    21        Male       Lasalle
Lisa M       22        Female     Ateneo
James T      21        Male       UP
Michael Z.   23        Male       TUP

Here are the expected output I need:

name       age     gender     similar name  on_lasalle on_ateneo on_up on_tup
Michael Z  21      Male       Michael Z.    True       False     False True
Lisa M     22      Female                   False      True      False False
James T    21      Male                     False      False     True  False

I've been trying to use fuzzywuzzy on my python script. The data I am getting is coming from bigquery, then I am comverting it to dataframe to clean some stuff. After that, I am converting the dataframe to a list of dictionaries.

Notice the above data where Michael Z. from TUP was appended to Michael Z from school Lasalle since they have similar names with 100% similarity rate using fuzz.token_set_ratio

What I want is to get all similar rows base on names and append it to the current dictionary we are looking at (including their school).

Here is the code and the loop to get similar rows base on names:

data_dict_list = data_df.to_dict('records')

for x in range(0, len(data_dict_list)):
    for y in range(x, len(data_dict_list)):
         if not data_dict_list[x]['is_duplicate']:
              similarity = fuzz.token_set_ratiod(data_dict_list[x]['name'], data_dict_list[y]['name'])
                   if similarity >= 90:
                       data_dict_list[x]['similar_names'].update('similar_name': data_dict_list[y]['name'])
                       ...
                       data_dict_list[x]['is_duplicate'] = True

The runtime of this script is very slow, as sometimes, I am getting 100,000+ data !!! So it will loop through all of that data.

How will I be able to speed up the process of this?

Suggesting pandas is much appreciated as I am having a hard time figuring out how to loop data in it.

Upvotes: 1

Views: 392

Answers (1)

maxbachmann
maxbachmann

Reputation: 3265

As a first step you can simply replace the import of fuzzywuzzy with rapidfuzz:

from rapidfuzz import fuzz

which should already improve the performance quite a bit. You can further improve the performance by comparing complete lists of strings in rapidfuzz in the following way:

>> import pandas as pd
>> from rapidfuzz import process, fuzz
>> df = pd.DataFrame(data={'name': ['test', 'tests']})
>> process.cdist(df['name'], df['name'], scorer=fuzz.token_set_ratio, score_cutoff=90)
array([[100,   0],
       [  0, 100]], dtype=uint8)

which returns a matrix of result where all elements with a score below 90 are set to 0. For large datasets you can enable multithreading using the workers argument:

process.cdist(df['name'], df['name'], workers=-1, scorer=fuzz.token_set_ratio, score_cutoff=90)

Upvotes: 2

Related Questions