Verify if dataframe contains content in dict of lists

Question

I realize the title is a bit fuzzy but let me illustrate.

I have a dataframe that (simplified) looks like this:

Date | transaction | amount

01-01-2020 | something keywordA something | 50

01-01-2020 | something something keywordB | 152

02-01-2020 | keywordA something else | 200

etc

Then the first thing I do is add an empty column called 'category' like this:

df['Category'] = ''

Now I then also have a dict of lists like this:

categories={'category A':['keywordA','keywordB'], 'category B':['keywordC']}

etc

Now what I want to do is see if the transaction column for each row contains one of the keywords and if it contains eg. keywordA I want the category column to be filled in with 'category A'.

So first of all I am not sure if using a dict of lists is the way to go here. Secondly I could probably go row by row, do a contains type function and compare to every item in the dict but from what I have read iterating over each row is normally bad practice when you are using pandas and just intuitively this will scale very badly.

So.. any hints as to what direction I should be looking at ?

anky · Accepted Answer

From what i understand , you can do this with series.str.extract and series.map after reversing the dictionary:

d = {a:k for k,v in categories.items() for a in v}
#{'keywordA': 'category A', 'keywordB': 'category A', 'keywordC': 'category B'}
pat = r'\b(?:{})\b'.format('|'.join(d.keys()))
df['Category'] = df['transaction'].str.extract('('+pat+')',expand=False).map(d)

print(df)

        Date                   transaction  amount    Category
0 2020-01-01  something keywordA something      50  category A
1 2020-01-01  something something keywordB     152  category A
2 2020-02-01       keywordA something else     200  category A

Verify if dataframe contains content in dict of lists

Answers (1)

Related Questions