Reputation: 5463
I have a large dataframe that has several word variations of a single word in one of its columns. I'd like to filter rows based on the specific word I'm looking for. A sample dataframe is as below. Here, I'd like to filter rows that have the word "create" in the "Resolution" column but not a substring of it such as "re-create" or "recreate".
Note: I'm only looking for a Regex solution to be applied in str.contains
In [4]: df = pd.DataFrame({"Resolution":["create profile", "recreate profile", "re-create profile", "created profile",
...: "re-created profile", "closed outlook and recreated profile", "purged outlook processes and created new profile
...: "], "Product":["Outlook", "Outlook", "Outlook", "Outlook", "Outlook", "Outlook", "Outlook"]})
In [5]: df
Out[5]:
Resolution Product
0 create profile Outlook
1 recreate profile Outlook
2 re-create profile Outlook
3 created profile Outlook
4 re-created profile Outlook
5 closed outlook and recreated profile Outlook
6 purged outlook processes and created new profile Outlook
My attempt:
I have been able to filter on "recreate" and "re-create"(past tense doesn't matter):
In [13]: df[df.Resolution.str.contains("(?=.*recreate|re-create)(?=.*profile)")]
Out[13]:
Resolution Product
1 recreate profile Outlook
2 re-create profile Outlook
4 re-created profile Outlook
5 closed outlook and recreated profile Outlook
Question: How do I modify the regex to only get me rows with "create" and not a substring? Something like this:
Resolution Product
0 create profile Outlook
3 created profile Outlook
6 purged outlook processes and created new profile Outlook
Upvotes: 1
Views: 78
Reputation: 862591
Add ~
for invert condition:
df = df[~df.Resolution.str.contains("(?=.*recreate|re-create)(?=.*profile)")]
print (df)
Resolution Product
0 create profile Outlook
3 created profile Outlook
6 purged outlook processes and created new profile Outlook
Upvotes: 1