Reputation: 16199
I know how to remove rows from a single-column ('From') pandas DataFrame where the row contains a string e.g given df
and somestring
:
df = df[~df.From.str.contains(someString)]
Now I wish to do something similar, but this time I wish to remove any rows that contain a string that is in any element of another list. Were I not using pandas, I would use for
and the if ... not ... in
approach. But how do I take advantage of pandas' own functionality to achieve this? Given the list of items to remove ignorethese, extracted from a file of comma-separated strings EMAILS_TO_IGNORE, I tried:
with open(EMAILS_TO_IGNORE) as emails:
ignorethese = emails.read().split(', ')
df = df[~df.From.isin(ignorethese)]
Am I convoluting matters by first decomposing the file into a list? Given that it is a plain text file of comma-separated values, can I bypass this with something simpler?
Upvotes: 4
Views: 2957
Reputation: 90899
Series.str.contains
supports regular expression , you can create a regex from your list of emails to ignore by using |
to OR
them , and then use that in contains
. Example -
df[~df.From.str.contains('|'.join(ignorethese))]
Demo -
In [109]: df
Out[109]:
From
0 Grey Caulfu <[email protected]>
1 Deren Torculas <[email protected]>
2 Charlto Youna <[email protected]>
In [110]: ignorelist = ['[email protected]','[email protected]']
In [111]: ignorere = '|'.join(ignorelist)
In [112]: df[~df.From.str.contains(ignorere)]
Out[112]:
From
2 Charlto Youna <[email protected]>
Please note, as mentioned in the documentation it uses re.search()
.
Upvotes: 4