Reputation: 13
I have a dataframe like this:
name | sentence |
---|---|
Tom | The cat is on the table. |
Bob | One might say that caterpillars are majestic |
I want to get as a result a dataframe like this:
name | sentence | contains_cat |
---|---|---|
Tom | The cat is on the table. | True |
Bob | One might say that caterpillars are majestic | False |
So the column "contains_cat" has to show True only if the corresponding row of column "sentence" contains exactly the word cat (not caterpillar, for example).
I wrote a code that does this, searching for words like " cat " or " cat." . Is it possible to speed this up, considering that I'd like to do this for large dataframes and for many words?
import pandas as pd
df = pd.DataFrame({'name': ['Tom', 'Bob'],
'sentence': ['The cat is on the table.', 'One might say that caterpillars are majestic']})
df['contains_cat'] = False
string_to_find = [' cat ',
'Cat ',
' cat.']
for ii in range(0,len(string_to_find)):
df1 = pd.DataFrame({'dummy': [string_to_find[ii]] * len(df)})
df['contains_cat'] = df['contains_cat'] | \
[x[0] in x[1] for x in zip(df1['dummy'], df['sentence'])]
print(df)
Upvotes: 0
Views: 1945
Reputation: 522084
Use str.contains
:
df["contains_cat"] = df["sentence"].str.contains(r'\bcat\b')
Note that the regex pattern \bcat\b
will find exact matches for the word cat
(but not cat
as part of a substring of a larger words such as caterpillar
). Regex search is enabled by default with str.contains
.
Upvotes: 1