Benjamin
Benjamin

Reputation: 721

Filtering a pandas dataframe based presence of substrings in column

Not sure if this is a 'filtering with pandas' question or one of text analysis, however:

Given a df,

d = {
    "item": ["a", "b", "c", "d"],
    "report": [
        "john rode the subway through new york",
        "sally says she no longer wanted any fish, but",
        "was not submitted",
        "the doctor proceeded to call washington and new york",
    ],
}
df = pd.DataFrame(data=d)
df

Resulting in

item, report
a, "john rode the subway through new york"
b, "sally says she no longer wanted any fish, but"
c, "was not submitted"
d, "the doctor proceeded to call washington and new york"

And a list of terms to match:

terms = ["new york", "fish"]

How would you reduce the the df to have the following rows, based on whether a substring in terms is found in column report and so that item is preserved?

item, report
a, "john rode the subway through new york"
b, "sally says she no longer wanted any fish, but"
d, "the doctor proceeded to call washington and new york"

Upvotes: 1

Views: 103

Answers (4)

rhug123
rhug123

Reputation: 8778

Try this:

Using a word boundary with your regex will ensure that "fish" will get matched, but "fishy" will not (as an example)

m = df['report'].str.contains(r'\b{}\b'.format(r'\b|\b'.join(terms)))

df2 = df.loc[m]

Output:

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

Upvotes: 2

PaulS
PaulS

Reputation: 25528

Another possible solution, which is based on numpy:

strings = np.array(df['report'], dtype=str)
substrings = np.array(terms)

index = np.char.find(strings[:, None], substrings)
mask = (index >= 0).any(axis=1)

df.loc[mask]

Output:

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

Upvotes: 1

Sunderam Dubey
Sunderam Dubey

Reputation: 8837

Try this:

df[df['report'].apply(lambda x: any(term in x for term in terms))]

Output:

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

Upvotes: 1

scotscotmcc
scotscotmcc

Reputation: 3133

Pulling from another answer here:

You can change your terms into a regex-usable single string (that is, | delimited) and then use df.Series.str.contains.

term_str = '|'.join(terms) # makes a string of 'new york|fish'
df[df['report'].str.contains(term_str)]

Upvotes: 1

Related Questions