Reputation: 35
I have two Pandas DataFrames (person names), one small (200+ rows) and another one pretty big (100k+ rows). They both have similar header but the big one has an unique ID too, as following:
Small: LST_NM, FRST_NM, CITY
Big: LST_NM, FRST_NM, CITY, UNIQUE_ID
Small: df2 = pd.DataFrame([['Doe','John','New York'], ['Obama', 'Barack', 'New York']], columns = ['FRST_NM', 'LST_NM', 'CITY_NM'])
Big: df = pd.DataFrame([['Doe','John','New York', 'N1'], ['Obama', 'Barack Hussein', 'New York', 'N2'], ['Obama', 'Michelle', 'Chicago', 'N3'], ['Trump', 'Donald', 'New York', 'N4']], columns = ['FRST_NM', 'LST_NM', 'CITY_NM', 'UNIQUE_ID'])
I use the below code:
import itertools
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import time
import pandas as pd
import multiprocessing as mp
from unidecode import unidecode
import re
#read the CSV files;
df = pd.read_csv("BIG.csv", encoding="utf-8")
df2 = pd.read_csv("SMALL.csv", encoding="utf-8")
#create function to clean the columns
def clean_column(column):
column = unidecode(column)
column = re.sub('\n', ' ', column)
column = re.sub('-', ' ', column)
column = re.sub('/', ' ', column)
column = re.sub("'", '', column)
column = re.sub(",", '', column)
column = re.sub(":", ' ', column)
column = re.sub(' +', ' ', column)
column = column.strip().strip('"').strip("'").lower().strip()
if not column :
column = None
return column
#Normalize, create FULL_NM by combining FRST_NM / LST_NM and then create MIN_CITY as the first three chars from CITY_NM:
df['FULL_NM'] = (df['LST_NM'] + ' ' + df['FRST_NM']).apply(lambda x: fuzz._process_and_sort(clean_column(x), True, True))
df['MIN_CITY'] = (df['CITY_NM']).astype(str).apply(lambda x: clean_column(x[:3]))
df2['FULL_NM'] = (df2['LST_NM'] + ' ' + df2['FRST_NM']).apply(lambda x: fuzz._process_and_sort(clean_column(x), True, True))
df2['MIN_CITY'] = (df2['CITY_NM']).astype(str).apply(lambda x: clean_column(x[:3]))
#create match1 function; it uses the FULL_NM as lookup field
def match1(x, choices, scorer, cutoff):
match = process.extractOne(x['FULL_NM'], choices=choices.loc[choices['MIN_CITY'] == x['MIN_CITY'],'FULL_NM'],
scorer=scorer,
score_cutoff=cutoff)
if match:
return match[0]
#and finally... create the MATCH_NM column by applying match1 function as following:
df2['MATCH_NAME'] = df2.apply(match1, args=(df, fuzz.token_set_ratio, 80), axis=1)
I want to lookup info from the big one into the small one, bringing the UNIQUE_ID. To speed up the process, I created smaller chunks (using first three letters of City). This new column (created in both DataFrames) is named MIN_CITY.
The code above works well but it brings only the matched name (MATCH_NAME). I don't want to reverse (bring from Small into Big and then filter). How could I get the UNIQUE_ID from process.ExtractOne()? I need to mention that I am very new into Python / Pandas / FuzzyWuzzy.
Upvotes: 1
Views: 1820
Reputation: 5545
Please look into fuzzymatcher:
from fuzzymatcher import link_table, fuzzy_left_join
left_on = ['FULL_NM']
right_on = ['FULL_NM']
fuzzy_left_join(df2, df, left_on, right_on)
Gives a table:
best_match_score __id_left __id_right FRST_NM_left LST_NM_left CITY_NM_left FULL_NM_left MIN_CITY_left MATCH_NAME FRST_NM_right LST_NM_right CITY_NM_right UNIQUE_ID FULL_NM_right MIN_CITY_right
0 0.167755 0_left 0_right Doe John New York doe john new doe john Doe John New York N1 doe john new
1 0.081166 1_left 1_right Obama Barack New York barack obama new barack hussein obama Obama Barack Hussein New York N2 barack hussein obama new
Here are some examples.
If you have to insist on using fuzzy_wuzzy.process.extractOne, you can match and find the unique_id from the matched name like:
def match1(x, choices, scorer, cutoff):
match = process.extractOne(x['FULL_NM'], choices=choices.loc[choices['MIN_CITY'] == x['MIN_CITY'],'FULL_NM'],
scorer=scorer,
score_cutoff=cutoff)
if match:
return choices[choices['FULL_NM'] == 'doe john'].UNIQUE_ID[0]
Upvotes: 2