Filter specific word(with variations) in a Pandas Series

Question

I have a large dataframe that has several word variations of a single word in one of its columns. I'd like to filter rows based on the specific word I'm looking for. A sample dataframe is as below. Here, I'd like to filter rows that have the word "create" in the "Resolution" column but not a substring of it such as "re-create" or "recreate".

Note: I'm only looking for a Regex solution to be applied in str.contains

In [4]: df = pd.DataFrame({"Resolution":["create profile", "recreate profile", "re-create profile", "created profile",
   ...: "re-created profile", "closed outlook and recreated profile", "purged outlook processes and created new profile
   ...: "], "Product":["Outlook", "Outlook", "Outlook", "Outlook", "Outlook", "Outlook", "Outlook"]})

In [5]: df
Out[5]:
                                         Resolution  Product
0                                    create profile  Outlook
1                                  recreate profile  Outlook
2                                 re-create profile  Outlook
3                                   created profile  Outlook
4                                re-created profile  Outlook
5              closed outlook and recreated profile  Outlook
6  purged outlook processes and created new profile  Outlook

My attempt:

I have been able to filter on "recreate" and "re-create"(past tense doesn't matter):

In [13]: df[df.Resolution.str.contains("(?=.*recreate|re-create)(?=.*profile)")]
Out[13]:
                             Resolution  Product
1                      recreate profile  Outlook
2                     re-create profile  Outlook
4                    re-created profile  Outlook
5  closed outlook and recreated profile  Outlook

Question: How do I modify the regex to only get me rows with "create" and not a substring? Something like this:

                                      Resolution  Product
0                                    create profile  Outlook
3                                   created profile  Outlook
6  purged outlook processes and created new profile  Outlook

jezrael · Accepted Answer

Add ~ for invert condition:

df = df[~df.Resolution.str.contains("(?=.*recreate|re-create)(?=.*profile)")]
print (df)
                                          Resolution  Product
0                                     create profile  Outlook
3                                    created profile  Outlook
6  purged outlook processes and created new profile   Outlook

Filter specific word(with variations) in a Pandas Series

Answers (1)

Related Questions