pandas if full string contained in another pandas dataframe

Question

I'd like to categorise parts using dataframes.

Simplifying the problem to try and show the issue:

data = {'col1': ['engine','blue engine cover','spark plug',
        'rear panel','black rear panel', 'blue engine']}
desc_df = pd.DataFrame(data=data)

catg = {'bodywork': ['engine cover','side panel','rear panel'],'underhood':['engine','spark plug','oil filter'],
   'Glass':['Windscreen','window','demister']}

catg_df = pd.DataFrame(data=catg)

catg_df


   Glass         bodywork       underhood
0 Windscreen     engine cover   engine 
1 window         side panel     spark plug 
2 demister       rear panel     oil filter 

desc_df

     col1
0   engine 
1 blue engine cover 
2 spark plug 
3 rear panel 
4 black rear panel 
5 blue engine

I would like to end up with :

  col1                Category
0 engine              underhood 
1 blue engine cover   underhood 
2 spark plug          underhood 
3 rear panel          bodywork 
4 black rear panel    bodywork 
5 blue engine         underhood

The closest I have come up with is:

d=catg_df.apply('|'.join).to_dict()

desc_df['Category'] = desc_df['col1'].apply(lambda x : ''.join([z if pd.Series(x).str.contains(y).values else '' for z,y in d.items()]))

But I end up with finding both "engine" and "engine cover" in the string: desc_df

col1                   Category
0 engine              underhood 
1 blue engine cover   bodyworkunderhood 
2 spark plug          underhood 
3 rear panel          bodywork 
4 black rear panel    bodywork 
5 blue engine         underhood

Is there some method I could use to perhaps if it finds "engine Cover" first then categorises using this category and does not move onto "engine".

jpp · Accepted Answer

You can solve this problem by iterating your dictionary:

from collections import OrderedDict

d = OrderedDict([(k, '|'.join(catg_df[k].tolist())) for k in catg_df.columns[::-1]])

for k, v in d.items():
    desc_df.loc[desc_df['col1'].str.contains(v), 'Category'] = k

Result

print(desc_df)

                col1   Category
0             engine  underhood
1  blue engine cover   bodywork
2         spark plug  underhood
3         rear panel   bodywork
4   black rear panel   bodywork
5        blue engine  underhood

Explanation

For each item in your dictionary, check your str.contains condition versus regex value and assign key to 'Category' column.
Use collections.OrderedDict to give priority to columns.
In this case, it is possible to reverse the iteration order of columns in construction of d.

pandas if full string contained in another pandas dataframe

Answers (2)

Related Questions