Pankaj Kaundal
Pankaj Kaundal

Reputation: 1022

return matched words with % matched column

I have this code that matches a word from df2 to df1 column account_name and returns it's associated category. The code flow goes as follows:

>>> import pandas as pd
>>> data = {'account_name':['prepaid', 'postpaid', 'books', 'stationary','software','printer', 'mouse'], 'category':['admin','admin','admin','admin','it','it','it']}
>>> df1 = pd.DataFrame(data)
>>> df1
  account_name category
0      prepaid    admin
1     postpaid    admin
2        books    admin
3   stationary    admin
4     software       it
5      printer       it
6        mouse       it
>>> data2 = {'account_name':['stationary costs', 'prepaid expenses', 'postpaid expenses', 'mouse', 'software expenses']}
>>> df2 =pd.DataFrame(data2)
>>> df2
        account_name
0   stationary costs
1   prepaid expenses
2  postpaid expenses
3              mouse
4  software expenses
>>> d = df1.set_index('account_name')['category'].to_dict()
>>> f = lambda x: next(iter(d[y] for y in x.split() if y in d))
>>> df2['category'] = df2['account_name'].apply(f)
>>> df2
        account_name category
0   stationary costs    admin
1   prepaid expenses    admin
2  postpaid expenses    admin
3              mouse       it
4  software expenses       it

This, however, has some flaws, it breaks down when there is a word in df2 that doesn't exactly have a match in df1['account_name'] for e.g. if I have a word 'rent' in df2['account_name'] it will break. Also, if I have a partial word in df2['account_name'] like "mous" instead of the full word "mouse" then it will also break.

What I would like to do is return the third column in df2 with match percentage which is basically based on the partial or full match word to word (more like fuzzy matching) and return the corresponding category if the match percentage is greater than 70% also a column with a match percentage.

Here's the expected output

>>> df2
account_name    category    match_perc
stationary costs    admin   71
prepaid expenses    admin   72
postpaid expenses   admin   72
mouse               it      100
software expenses   it      73
chair               n/a     0
table               n/a     0
mous                it      90

Upvotes: 1

Views: 62

Answers (1)

DumbCoder
DumbCoder

Reputation: 445

Using the code taken from this link

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
data = {'account_name':['prepaid', 'postpaid', 'books', 'stationary','software','printer', 'mouse'], 'category':['admin','admin','admin','admin','it','it','it']}
df1 = pd.DataFrame(data)
df1

data2 = {'account_name':['stationary costs', 'prepaid expenses', 'postpaid expenses', 'mouse', 'software expenses', 'chair', 'table', 'mous']}
df2 =pd.DataFrame(data2)

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: dataframe with boths keys and matches
    """
    s = df_2[key2].tolist()
    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    score = df_1['matches'].apply(lambda x: ', '.join([str(i[1]) for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    df_1['match_perc'] = score
    return df_1

df2 = fuzzy_merge(df2, df1, 'account_name', 'account_name', 70)
df2 = pd.merge(df2, df1[['account_name', 'category']], left_on = 'matches', right_on = 'account_name', how = 'left')

df2 = df2.rename(columns = {'account_name_x':'account_name'}).drop(['account_name_y', 'matches'], axis = 1)

This should work as per your requirement.

>>> df2
        account_name match_perc category
0   stationary costs         90    admin
1   prepaid expenses         90    admin
2  postpaid expenses         90    admin
3              mouse        100       it
4  software expenses         90       it
5              chair                 NaN
6              table                 NaN
7               mous         89       it

Upvotes: 1

Related Questions