Reputation: 21625
I have a dataframe with two columns foo
which contains a string of text and bar
which contains a search term string. For each row in my dataframe I want to check if the search term is in the text string with word boundaries.
For example
import pandas as pd
import numpy as np
import re
df = pd.DataFrame({'foo':["the dog is blue", "the cat isn't orange"], 'bar':['dog', 'cat is']})
df
bar foo
0 dog the dog is blue
1 cat is the cat isn't orange
Essentially I want to vectorize the following operations
re.search(r"\bdog\b", "the dog is blue") is not None # True
re.search(r"\bcat is\b", "the cat isn't orange") is not None # False
What's a fast way to do this, considering I'm working with a few hundred thousand rows? I tried using the str.contains method but couldn't quite get it.
Upvotes: 0
Views: 643
Reputation: 8449
df.apply(lambda x: re.search(r'\b{0}\b'.format(x.bar), x.foo) is not None, axis='columns')
df.apply applies a generic function to pandas row or columns see more here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
Upvotes: 1
Reputation: 85442
You can apply your function to each row:
df.apply(lambda x: re.search(r'\b' + x.bar + r'\b', x.foo) is not None, axis=1)
Result:
0 True
1 False
dtype: bool
Upvotes: 1