Reputation: 1022
I have this code that matches a word from df2 to df1 column account_name and returns it's associated category. The code flow goes as follows:
>>> import pandas as pd
>>> data = {'account_name':['prepaid', 'postpaid', 'books', 'stationary','software','printer', 'mouse'], 'category':['admin','admin','admin','admin','it','it','it']}
>>> df1 = pd.DataFrame(data)
>>> df1
account_name category
0 prepaid admin
1 postpaid admin
2 books admin
3 stationary admin
4 software it
5 printer it
6 mouse it
>>> data2 = {'account_name':['stationary costs', 'prepaid expenses', 'postpaid expenses', 'mouse', 'software expenses']}
>>> df2 =pd.DataFrame(data2)
>>> df2
account_name
0 stationary costs
1 prepaid expenses
2 postpaid expenses
3 mouse
4 software expenses
>>> d = df1.set_index('account_name')['category'].to_dict()
>>> f = lambda x: next(iter(d[y] for y in x.split() if y in d))
>>> df2['category'] = df2['account_name'].apply(f)
>>> df2
account_name category
0 stationary costs admin
1 prepaid expenses admin
2 postpaid expenses admin
3 mouse it
4 software expenses it
This, however, has some flaws, it breaks down when there is a word in df2 that doesn't exactly have a match in df1['account_name'] for e.g. if I have a word 'rent' in df2['account_name'] it will break. Also, if I have a partial word in df2['account_name'] like "mous" instead of the full word "mouse" then it will also break.
What I would like to do is return the third column in df2 with match percentage which is basically based on the partial or full match word to word (more like fuzzy matching) and return the corresponding category if the match percentage is greater than 70% also a column with a match percentage.
Here's the expected output
>>> df2
account_name category match_perc
stationary costs admin 71
prepaid expenses admin 72
postpaid expenses admin 72
mouse it 100
software expenses it 73
chair n/a 0
table n/a 0
mous it 90
Upvotes: 1
Views: 62
Reputation: 445
Using the code taken from this link
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
data = {'account_name':['prepaid', 'postpaid', 'books', 'stationary','software','printer', 'mouse'], 'category':['admin','admin','admin','admin','it','it','it']}
df1 = pd.DataFrame(data)
df1
data2 = {'account_name':['stationary costs', 'prepaid expenses', 'postpaid expenses', 'mouse', 'software expenses', 'chair', 'table', 'mous']}
df2 =pd.DataFrame(data2)
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
score = df_1['matches'].apply(lambda x: ', '.join([str(i[1]) for i in x if i[1] >= threshold]))
df_1['matches'] = m2
df_1['match_perc'] = score
return df_1
df2 = fuzzy_merge(df2, df1, 'account_name', 'account_name', 70)
df2 = pd.merge(df2, df1[['account_name', 'category']], left_on = 'matches', right_on = 'account_name', how = 'left')
df2 = df2.rename(columns = {'account_name_x':'account_name'}).drop(['account_name_y', 'matches'], axis = 1)
This should work as per your requirement.
>>> df2
account_name match_perc category
0 stationary costs 90 admin
1 prepaid expenses 90 admin
2 postpaid expenses 90 admin
3 mouse 100 it
4 software expenses 90 it
5 chair NaN
6 table NaN
7 mous 89 it
Upvotes: 1