Reputation: 329
I realize the title is a bit fuzzy but let me illustrate.
I have a dataframe that (simplified) looks like this:
Date | transaction | amount
01-01-2020 | something keywordA something | 50
01-01-2020 | something something keywordB | 152
02-01-2020 | keywordA something else | 200
etc
Then the first thing I do is add an empty column called 'category' like this:
df['Category'] = ''
Now I then also have a dict of lists like this:
categories={'category A':['keywordA','keywordB'], 'category B':['keywordC']}
etc
Now what I want to do is see if the transaction column for each row contains one of the keywords and if it contains eg. keywordA I want the category column to be filled in with 'category A'.
So first of all I am not sure if using a dict of lists is the way to go here. Secondly I could probably go row by row, do a contains type function and compare to every item in the dict but from what I have read iterating over each row is normally bad practice when you are using pandas and just intuitively this will scale very badly.
So.. any hints as to what direction I should be looking at ?
Upvotes: 3
Views: 61
Reputation: 75080
From what i understand , you can do this with series.str.extract
and series.map
after reversing the dictionary:
d = {a:k for k,v in categories.items() for a in v}
#{'keywordA': 'category A', 'keywordB': 'category A', 'keywordC': 'category B'}
pat = r'\b(?:{})\b'.format('|'.join(d.keys()))
df['Category'] = df['transaction'].str.extract('('+pat+')',expand=False).map(d)
print(df)
Date transaction amount Category
0 2020-01-01 something keywordA something 50 category A
1 2020-01-01 something something keywordB 152 category A
2 2020-02-01 keywordA something else 200 category A
Upvotes: 5