Pankaj Kaundal
Pankaj Kaundal

Reputation: 1022

Match a word in pandas data frame column and return its value

I have two data frames as

        data = {'account_name':['prepaid', 'postpaid', 'books', 'stationary','software','printer', 'mouse'], 'category':['admin','admin','admin','admin','it','it','it']}
    
    df1 = pd.DataFrame(data)
    
        df1>        
    account_name          category
0   prepaid               admin
1   postpaid              admin
2   books                 admin
3   stationary            admin
4   software              it
5   printer               it
6   mouse                 it
    
    data2 = {'account_name':['stationary costs', 'prepaid expenses', 'postpaid expenses', 'mouse', 'software expenses']}
    
    df2 =pd.DataFrame(data2)
    
        df2>   
                account_name       
            0   stationary costs    
            1   prepaid expenses    
            2   postpaid expenses   
            3   mouse   
            4   software expenses

What I would like to do is partial match the df2['account_name'] column in the df1['account_name'] column and if there is any match in any of the words in that column then return its corresponding category in df2. It could be partial and full match both

df2>

    account_name            category
    stationary costs        admin
    prepaid expenses        admin
    postpaid expenses       admin
    dell mouse purchased    it
    software subs           it

Any idea how to do it?

Upvotes: 1

Views: 225

Answers (1)

jezrael
jezrael

Reputation: 863651

If need mapping multiple words together is possible use Series.str.extract by keys of dict and matching by Series.map:

d = df1.set_index('account_name')['category']

pat = '|'.join(r"\b{}\b".format(x) for x in d.keys())
df2['keyword'] = df2['account_name'].str.extract('('+pat+')', expand=False).map(d)

If possible split values by spaces and map each word separately use:

d = df1.set_index('account_name')['category']

f = lambda x: next(iter(d[y] for y in x.split() if y in d))
df2['category'] = df2['account_name'].apply(f)
print (df2)
        account_name category
0   stationary costs    admin
1   prepaid expenses    admin
2  postpaid expenses    admin
3              mouse       it
4  software expenses       it

Testing if return NaN for not matched values:

data = {'account_name':[ 'postpaid', 'books', 'stationary','software','printer', 'mouse'],
              'category':['admin','admin','admin','it','it','it']}
    
df1 = pd.DataFrame(data)


data2 = {'account_name':['stationary costs', 'prepaid expenses',
                         'postpaid expenses', 'mouse', 'software expenses']}
    
df2 =pd.DataFrame(data2)


d = df1.set_index('account_name')['category']

pat = '|'.join(r"\b{}\b".format(x) for x in d.keys())
df2['keyword'] = df2['account_name'].str.extract('('+pat+')', expand=False).map(d)
print (df2)
        account_name keyword
0   stationary costs   admin
1   prepaid expenses     NaN
2  postpaid expenses   admin
3              mouse      it
4  software expenses      it

Upvotes: 1

Related Questions