sudonym
sudonym

Reputation: 4018

Scalable solution for str.contains with list of strings in pandas

I am parsing a pandas dataframe df1 containing string object rows. I have a reference list of keywords and need to delete every row in df1 containing any word from the reference list.

Currently, I do it like this:

reference_list: ["words", "to", "remove"]
df1 = df1[~df1[0].str.contains(r"words")]
df1 = df1[~df1[0].str.contains(r"to")]
df1 = df1[~df1[0].str.contains(r"remove")]

Which is not not scalable to thousands of words. However, when I do:

df1 = df1[~df1[0].str.contains(reference_word for reference_word in reference_list)]

I yield the error first argument must be string or compiled pattern.

Following this solution, I tried:

reference_list: "words|to|remove" 
df1 = df1[~df1[0].str.contains(reference_list)]

Which doesn't raise an exception but doesn't parse all words eather.

How to effectively use str.contains with a list of words?

Upvotes: 10

Views: 14797

Answers (1)

cs95
cs95

Reputation: 402844

For a scalable solution, do the following -

  1. join the contents of words by the regex OR pipe |
  2. pass this to str.contains
  3. use the result to filter df1

To index the 0th column, don't use df1[0] (as this might be considered ambiguous). It would be better to use loc or iloc (see below).

words = ["words", "to", "remove"]
mask = df1.iloc[:, 0].str.contains(r'\b(?:{})\b'.format('|'.join(words)))
df1 = df1[~mask]

Note: This will also work if words is a Series.


Alternatively, if your 0th column is a column of words only (not sentences), then you can use df.isin, which should be faster -

df1 = df1[~df1.iloc[:, 0].isin(words)]

Upvotes: 19

Related Questions