Reputation: 25
I want to filter out some rows in a dataframe where the Category Row includes one of the items from my list.
It works with exact matches, but not if there is more than one category associated with a record.
For example:
ID | Categories | Date | Users |
---|---|---|---|
1 | A | 01.01.2020 | alfa |
2 | B | 01.01.2020 | beta |
3 | C | 01.01.2020 | nick |
4 | A, C | 01.01.2020 | ray |
5 | A, B | 01.01.2020 | john |
6 | B, C | 01.01.2020 | smith |
So I try to use the following code:
exclude_list = ['A', 'C']
df = df[~df['Categories'].isin(exclude_list)]
this gives me
ID | Categories | Date | Users |
---|---|---|---|
2 | B | 01.01.2020 | beta |
4 | A, C | 01.01.2020 | ray |
5 | A, B | 01.01.2020 | john |
6 | B, C | 01.01.2020 | smith |
The 'Categories' where A and C are both present is still showing in the DF. I want to only show those Categories where A and C are not present.
I also tried to add variations such as ',A' 'A' 'A,' - but no luck.
Upvotes: 1
Views: 291
Reputation: 862511
Use Series.str.contains
for test by substrings if possible:
exclude_list = ['A', 'C']
df1 = df[~df['Categories'].str.contains('|'.join(exclude_list))]
Another idea is explode values splitted by ,
and test with isin
with any
for match at least one value per index:
exclude_list = ['A', 'C']
df1 = df[~df['Categories'].str.split(', ').explode().isin(exclude_list).any(level=0)]
Last possible solution, with apply, so maybe slowier in real data:
df1 = df[~df['Categories'].apply(lambda x: any([y in exclude_list for y in x.split(', ')]))]
print (df1)
ID Categories Date Users
1 2 B 01.01.2020 beta
Upvotes: 3