Reputation: 1022
I have two data frames as
data = {'account_name':['prepaid', 'postpaid', 'books', 'stationary','software','printer', 'mouse'], 'category':['admin','admin','admin','admin','it','it','it']}
df1 = pd.DataFrame(data)
df1>
account_name category
0 prepaid admin
1 postpaid admin
2 books admin
3 stationary admin
4 software it
5 printer it
6 mouse it
data2 = {'account_name':['stationary costs', 'prepaid expenses', 'postpaid expenses', 'mouse', 'software expenses']}
df2 =pd.DataFrame(data2)
df2>
account_name
0 stationary costs
1 prepaid expenses
2 postpaid expenses
3 mouse
4 software expenses
What I would like to do is partial match the df2['account_name']
column in the df1['account_name']
column and if there is any match in any of the words in that column then return its corresponding category in df2
. It could be partial and full match both
df2>
account_name category
stationary costs admin
prepaid expenses admin
postpaid expenses admin
dell mouse purchased it
software subs it
Any idea how to do it?
Upvotes: 1
Views: 225
Reputation: 863651
If need mapping multiple words together is possible use Series.str.extract
by keys of dict and matching by Series.map
:
d = df1.set_index('account_name')['category']
pat = '|'.join(r"\b{}\b".format(x) for x in d.keys())
df2['keyword'] = df2['account_name'].str.extract('('+pat+')', expand=False).map(d)
If possible split values by spaces and map each word separately use:
d = df1.set_index('account_name')['category']
f = lambda x: next(iter(d[y] for y in x.split() if y in d))
df2['category'] = df2['account_name'].apply(f)
print (df2)
account_name category
0 stationary costs admin
1 prepaid expenses admin
2 postpaid expenses admin
3 mouse it
4 software expenses it
Testing if return NaN
for not matched values:
data = {'account_name':[ 'postpaid', 'books', 'stationary','software','printer', 'mouse'],
'category':['admin','admin','admin','it','it','it']}
df1 = pd.DataFrame(data)
data2 = {'account_name':['stationary costs', 'prepaid expenses',
'postpaid expenses', 'mouse', 'software expenses']}
df2 =pd.DataFrame(data2)
d = df1.set_index('account_name')['category']
pat = '|'.join(r"\b{}\b".format(x) for x in d.keys())
df2['keyword'] = df2['account_name'].str.extract('('+pat+')', expand=False).map(d)
print (df2)
account_name keyword
0 stationary costs admin
1 prepaid expenses NaN
2 postpaid expenses admin
3 mouse it
4 software expenses it
Upvotes: 1