c_c
c_c

Reputation: 35

Python: Return Pandas DataFrame from FuzzyWuzzy ExtractOne()

I have two Pandas DataFrames (person names), one small (200+ rows) and another one pretty big (100k+ rows). They both have similar header but the big one has an unique ID too, as following:

Small: LST_NM, FRST_NM, CITY
Big: LST_NM, FRST_NM, CITY, UNIQUE_ID

Small: df2 = pd.DataFrame([['Doe','John','New York'], ['Obama', 'Barack', 'New York']], columns = ['FRST_NM', 'LST_NM', 'CITY_NM'])

Small

Big: df = pd.DataFrame([['Doe','John','New York', 'N1'], ['Obama', 'Barack Hussein', 'New York', 'N2'], ['Obama', 'Michelle', 'Chicago', 'N3'], ['Trump', 'Donald', 'New York', 'N4']], columns = ['FRST_NM', 'LST_NM', 'CITY_NM', 'UNIQUE_ID'])

Big

I use the below code:

import itertools
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import time
import pandas as pd
import multiprocessing as mp
from unidecode import unidecode
import re

#read the CSV files;
df = pd.read_csv("BIG.csv", encoding="utf-8")
df2 = pd.read_csv("SMALL.csv", encoding="utf-8")

#create function to clean the columns
def clean_column(column):
    column = unidecode(column)
    column = re.sub('\n', ' ', column)
    column = re.sub('-', ' ', column)
    column = re.sub('/', ' ', column)
    column = re.sub("'", '', column)
    column = re.sub(",", '', column)
    column = re.sub(":", ' ', column)
    column = re.sub('  +', ' ', column)
    column = column.strip().strip('"').strip("'").lower().strip()
    if not column :
        column = None
    return column

#Normalize, create FULL_NM by combining FRST_NM / LST_NM and then create MIN_CITY as the first three chars from CITY_NM:
df['FULL_NM'] = (df['LST_NM'] + ' ' + df['FRST_NM']).apply(lambda x: fuzz._process_and_sort(clean_column(x), True, True))
df['MIN_CITY'] = (df['CITY_NM']).astype(str).apply(lambda x: clean_column(x[:3]))

df2['FULL_NM'] = (df2['LST_NM'] + ' ' + df2['FRST_NM']).apply(lambda x: fuzz._process_and_sort(clean_column(x), True, True))
df2['MIN_CITY'] = (df2['CITY_NM']).astype(str).apply(lambda x: clean_column(x[:3]))

#create match1 function; it uses the FULL_NM as lookup field
def match1(x, choices, scorer, cutoff):
        match = process.extractOne(x['FULL_NM'], choices=choices.loc[choices['MIN_CITY'] == x['MIN_CITY'],'FULL_NM'], 
                                   scorer=scorer, 
                                   score_cutoff=cutoff)
        if match:
            return match[0]

#and finally... create the MATCH_NM column by applying match1 function as following:
df2['MATCH_NAME'] = df2.apply(match1, args=(df, fuzz.token_set_ratio, 80), axis=1)

I want to lookup info from the big one into the small one, bringing the UNIQUE_ID. To speed up the process, I created smaller chunks (using first three letters of City). This new column (created in both DataFrames) is named MIN_CITY.

The code above works well but it brings only the matched name (MATCH_NAME). I don't want to reverse (bring from Small into Big and then filter). How could I get the UNIQUE_ID from process.ExtractOne()? I need to mention that I am very new into Python / Pandas / FuzzyWuzzy.

Upvotes: 1

Views: 1820

Answers (1)

Nihal Sangeeth
Nihal Sangeeth

Reputation: 5545

Please look into fuzzymatcher:

from fuzzymatcher import link_table, fuzzy_left_join

left_on = ['FULL_NM']
right_on = ['FULL_NM']

fuzzy_left_join(df2, df, left_on, right_on)

Gives a table:

best_match_score    __id_left   __id_right  FRST_NM_left    LST_NM_left CITY_NM_left    FULL_NM_left    MIN_CITY_left   MATCH_NAME  FRST_NM_right   LST_NM_right    CITY_NM_right   UNIQUE_ID   FULL_NM_right   MIN_CITY_right
0   0.167755    0_left  0_right Doe John    New York    doe john    new doe john    Doe John    New York    N1  doe john    new
1   0.081166    1_left  1_right Obama   Barack  New York    barack obama    new barack hussein obama    Obama   Barack Hussein  New York    N2  barack hussein obama    new

Here are some examples.

If you have to insist on using fuzzy_wuzzy.process.extractOne, you can match and find the unique_id from the matched name like:

def match1(x, choices, scorer, cutoff):
        match = process.extractOne(x['FULL_NM'], choices=choices.loc[choices['MIN_CITY'] == x['MIN_CITY'],'FULL_NM'], 
                                   scorer=scorer, 
                                   score_cutoff=cutoff)
        if match:
            return choices[choices['FULL_NM'] == 'doe john'].UNIQUE_ID[0]

Upvotes: 2

Related Questions