Reputation: 5143
I've done some searching and can't figure out how to filter a dataframe by
df["col"].str.contains(word)
however I'm wondering if there is a way to do the reverse: filter a dataframe by that set's compliment. eg: to the effect of
!(df["col"].str.contains(word))
Can this be done through a DataFrame
method?
Upvotes: 292
Views: 449696
Reputation: 21
Another simple way could be to use query method, combined with f-string and NOT.
It should be like: df.query(f"not {col}.str.contains('{word}')")
Upvotes: 2
Reputation: 11
To add clarity to the top answer, the general pattern for filtering all columns that contain a specific word is:
# Remove any column with "word" in the name
new_df = df.loc[:, ~df.columns.str.contains("word")]
# Filter multiple words
new_df = df.loc[:, ~df.columns.str.contains("word1|word2")]
Upvotes: 1
Reputation: 375565
You can use the invert (~) operator (which acts like a not for boolean data):
new_df = df[~df["col"].str.contains(word)]
where new_df
is the copy returned by RHS.
contains also accepts a regular expression...
If the above throws a ValueError or TypeError, the reason is likely because you have mixed datatypes, so use na=False
:
new_df = df[~df["col"].str.contains(word, na=False)]
Or,
new_df = df[df["col"].str.contains(word) == False]
Upvotes: 556
Reputation: 490
somehow '.contains' didn't work for me but when I tried with '.isin' as mentioned by @kenan in the answer (How to drop rows from pandas data frame that contains a particular string in a particular column?) it works. Adding further, if you want to look at the entire dataframe and remove those rows which has the specific word (or set of words) just use the loop below
for col in df.columns:
df = df[~df[col].isin(['string or string list separeted by comma'])]
just remove ~ to get the dataframe that contains the word
Upvotes: 2
Reputation: 49
To compliment to the above question, if someone wants to remove all the rows with strings, one could do:
df_new=df[~df['col_name'].apply(lambda x: isinstance(x, str))]
Upvotes: 1
Reputation: 1614
I hope the answers are already posted
I am adding the framework to find multiple words and negate those from dataFrame.
Here 'word1','word2','word3','word4'
= list of patterns to search
df
= DataFrame
column_a
= A column name from DataFrame df
values_to_remove = ['word1','word2','word3','word4']
pattern = '|'.join(values_to_remove)
result = df.loc[~df['column_a'].str.contains(pattern, case=False)]
Upvotes: 21
Reputation: 2310
To negate your query use ~
. Using query
has the advantage of returning the valid observations of df
directly:
df.query('~col.str.contains("word").values')
Upvotes: 10
Reputation: 1054
You can use Apply and Lambda :
df[df["col"].apply(lambda x: word not in x)]
Or if you want to define more complex rule, you can use AND:
df[df["col"].apply(lambda x: word_1 not in x and word_2 not in x)]
Upvotes: 22
Reputation: 1497
I was having trouble with the not (~) symbol as well, so here's another way from another StackOverflow thread:
df[df["col"].str.contains('this|that')==False]
Upvotes: 94
Reputation: 71580
Additional to nanselm2's answer, you can use 0
instead of False
:
df["col"].str.contains(word)==0
Upvotes: 3
Reputation: 2853
I had to get rid of the NULL values before using the command recommended by Andy above. An example:
df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df
first second third
0 myword myword NaN
1 myword NaN myword
2 myword myword NaN
Now running the command:
~df["second"].str.contains(word)
I get the following error:
TypeError: bad operand type for unary ~: 'float'
I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.
Upvotes: 8