user13748410
user13748410

Reputation: 13

Fastest way to find a word in a pandas dataframe column

I have a dataframe like this:

name sentence
Tom The cat is on the table.
Bob One might say that caterpillars are majestic

I want to get as a result a dataframe like this:

name sentence contains_cat
Tom The cat is on the table. True
Bob One might say that caterpillars are majestic False

So the column "contains_cat" has to show True only if the corresponding row of column "sentence" contains exactly the word cat (not caterpillar, for example).

I wrote a code that does this, searching for words like " cat " or " cat." . Is it possible to speed this up, considering that I'd like to do this for large dataframes and for many words?

import pandas as pd

df = pd.DataFrame({'name': ['Tom', 'Bob'],
              'sentence': ['The cat is on the table.', 'One might say that caterpillars are majestic']})
df['contains_cat'] = False

string_to_find = [' cat ',
                  'Cat ',
                  ' cat.']
for ii in range(0,len(string_to_find)):
    df1 = pd.DataFrame({'dummy': [string_to_find[ii]] * len(df)})
    df['contains_cat'] = df['contains_cat'] | \
                         [x[0] in x[1] for x in zip(df1['dummy'], df['sentence'])]

print(df)

Upvotes: 0

Views: 1945

Answers (1)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522084

Use str.contains:

df["contains_cat"] = df["sentence"].str.contains(r'\bcat\b')

Note that the regex pattern \bcat\b will find exact matches for the word cat (but not cat as part of a substring of a larger words such as caterpillar). Regex search is enabled by default with str.contains.

Upvotes: 1

Related Questions