Reputation: 721
Not sure if this is a 'filtering with pandas' question or one of text analysis, however:
Given a df,
d = {
"item": ["a", "b", "c", "d"],
"report": [
"john rode the subway through new york",
"sally says she no longer wanted any fish, but",
"was not submitted",
"the doctor proceeded to call washington and new york",
],
}
df = pd.DataFrame(data=d)
df
Resulting in
item, report
a, "john rode the subway through new york"
b, "sally says she no longer wanted any fish, but"
c, "was not submitted"
d, "the doctor proceeded to call washington and new york"
And a list of terms to match:
terms = ["new york", "fish"]
How would you reduce the the df to have the following rows, based on whether a substring in terms
is found in column report
and so that item
is preserved?
item, report
a, "john rode the subway through new york"
b, "sally says she no longer wanted any fish, but"
d, "the doctor proceeded to call washington and new york"
Upvotes: 1
Views: 103
Reputation: 8778
Try this:
Using a word boundary with your regex will ensure that "fish" will get matched, but "fishy" will not (as an example)
m = df['report'].str.contains(r'\b{}\b'.format(r'\b|\b'.join(terms)))
df2 = df.loc[m]
Output:
item report
0 a john rode the subway through new york
1 b sally says she no longer wanted any fish, but
3 d the doctor proceeded to call washington and ne...
Upvotes: 2
Reputation: 25528
Another possible solution, which is based on numpy
:
strings = np.array(df['report'], dtype=str)
substrings = np.array(terms)
index = np.char.find(strings[:, None], substrings)
mask = (index >= 0).any(axis=1)
df.loc[mask]
Output:
item report
0 a john rode the subway through new york
1 b sally says she no longer wanted any fish, but
3 d the doctor proceeded to call washington and ne...
Upvotes: 1
Reputation: 8837
Try this:
df[df['report'].apply(lambda x: any(term in x for term in terms))]
Output:
item report
0 a john rode the subway through new york
1 b sally says she no longer wanted any fish, but
3 d the doctor proceeded to call washington and ne...
Upvotes: 1
Reputation: 3133
Pulling from another answer here:
You can change your terms
into a regex-usable single string (that is, |
delimited) and then use df.Series.str.contains
.
term_str = '|'.join(terms) # makes a string of 'new york|fish'
df[df['report'].str.contains(term_str)]
Upvotes: 1