Reputation: 79
I'd like to categorise parts using dataframes.
Simplifying the problem to try and show the issue:
data = {'col1': ['engine','blue engine cover','spark plug',
'rear panel','black rear panel', 'blue engine']}
desc_df = pd.DataFrame(data=data)
catg = {'bodywork': ['engine cover','side panel','rear panel'],'underhood':['engine','spark plug','oil filter'],
'Glass':['Windscreen','window','demister']}
catg_df = pd.DataFrame(data=catg)
catg_df
Glass bodywork underhood
0 Windscreen engine cover engine
1 window side panel spark plug
2 demister rear panel oil filter
desc_df
col1
0 engine
1 blue engine cover
2 spark plug
3 rear panel
4 black rear panel
5 blue engine
I would like to end up with :
col1 Category
0 engine underhood
1 blue engine cover underhood
2 spark plug underhood
3 rear panel bodywork
4 black rear panel bodywork
5 blue engine underhood
The closest I have come up with is:
d=catg_df.apply('|'.join).to_dict()
desc_df['Category'] = desc_df['col1'].apply(lambda x : ''.join([z if pd.Series(x).str.contains(y).values else '' for z,y in d.items()]))
But I end up with finding both "engine" and "engine cover" in the string: desc_df
col1 Category
0 engine underhood
1 blue engine cover bodyworkunderhood
2 spark plug underhood
3 rear panel bodywork
4 black rear panel bodywork
5 blue engine underhood
Is there some method I could use to perhaps if it finds "engine Cover" first then categorises using this category and does not move onto "engine".
Upvotes: 4
Views: 80
Reputation: 18208
One way may be to use difflib
to get closest value and lambda
:
First creating a mapper:
from difflib import get_close_matches
mapper = {val:k for k, v in catg_df.to_dict('list').items() for val in v}
print(mapper)
So, mapper would be as:
{'Windscreen': 'Glass',
'demister': 'Glass',
'engine': 'underhood',
'engine cover': 'bodywork',
'oil filter': 'underhood',
'rear panel': 'bodywork',
'side panel': 'bodywork',
'spark plug': 'underhood',
'window': 'Glass'}
Now, using lambda
with difflib
to find the closest value:
# avoid calling mapper.keys() in lambda
keys = mapper.keys()
desc_df['Category'] = desc_df['col1'].apply(lambda row: mapper[get_close_matches(row, keys)[0]])
Result:
col1 Category
0 engine underhood
1 blue engine cover bodywork
2 spark plug underhood
3 rear panel bodywork
4 black rear panel bodywork
5 blue engine underhood
Upvotes: 3
Reputation: 164623
You can solve this problem by iterating your dictionary:
from collections import OrderedDict
d = OrderedDict([(k, '|'.join(catg_df[k].tolist())) for k in catg_df.columns[::-1]])
for k, v in d.items():
desc_df.loc[desc_df['col1'].str.contains(v), 'Category'] = k
Result
print(desc_df)
col1 Category
0 engine underhood
1 blue engine cover bodywork
2 spark plug underhood
3 rear panel bodywork
4 black rear panel bodywork
5 blue engine underhood
Explanation
str.contains
condition versus regex value and assign key to 'Category' column.collections.OrderedDict
to give priority to columns.d
.Upvotes: 2