Pyderman
Pyderman

Reputation: 16199

Removing rows in a pandas DataFrame where the row contains a string present in a list?

I know how to remove rows from a single-column ('From') pandas DataFrame where the row contains a string e.g given df and somestring:

df = df[~df.From.str.contains(someString)]

Now I wish to do something similar, but this time I wish to remove any rows that contain a string that is in any element of another list. Were I not using pandas, I would use for and the if ... not ... in approach. But how do I take advantage of pandas' own functionality to achieve this? Given the list of items to remove ignorethese, extracted from a file of comma-separated strings EMAILS_TO_IGNORE, I tried:

with open(EMAILS_TO_IGNORE) as emails:
        ignorethese = emails.read().split(', ')
        df = df[~df.From.isin(ignorethese)]

Am I convoluting matters by first decomposing the file into a list? Given that it is a plain text file of comma-separated values, can I bypass this with something simpler?

Upvotes: 4

Views: 2957

Answers (1)

Anand S Kumar
Anand S Kumar

Reputation: 90899

Series.str.contains supports regular expression , you can create a regex from your list of emails to ignore by using | to OR them , and then use that in contains . Example -

df[~df.From.str.contains('|'.join(ignorethese))]

Demo -

In [109]: df
Out[109]:
                                         From
0         Grey Caulfu <[email protected]>
1  Deren Torculas <[email protected]>
2    Charlto Youna <[email protected]>

In [110]: ignorelist = ['[email protected]','[email protected]']

In [111]: ignorere = '|'.join(ignorelist)

In [112]: df[~df.From.str.contains(ignorere)]
Out[112]:
                                       From
2  Charlto Youna <[email protected]>

Please note, as mentioned in the documentation it uses re.search() .

Upvotes: 4

Related Questions