Reputation: 349
I want to return the values in a column based on whether its value contains (i.e., has substring) any string within a list of strings.
For example,
values = ['dog', 'cat', 'ant']
df = pd.DataFrame({'col1': ['dog', 'cat', 'fox', 'monkey', 'antelope'], 'col2': [3, 4, 1, 6, 9]})
I know that if I want to compare vs one substring, I can:
df[df['col1'].str.contains('dog')
And if I knew the full values (as opposed to just a substring), I could do:
df.loc[df['col1'].isin(values)]
However, I'm not sure how to combine the two functions.
I was thinking I could loop over.
def func(data):
for x in values:
if x in data:
return True
return False
df['include'] = df.apply(func)
But this doesn't work (my column just is 'NaN' values)--and it honestly seems like there is probably a better way.
Upvotes: 1
Views: 773
Reputation: 11374
A bit late but this should work ;)
df = df[df['col1'].str.contains('|'.join(values), case=False, na=False)]
Upvotes: 0