Reputation: 33
I have a dataframe with a list of sentences in one column and am trying to create a new column equal to the number of times a list of strings show up.
For example, the relevant dataframe looks like
book['sentences']
0 The brown dog jumped over the big moon
1 The brown fox slid under the brown log
I'm trying to count the number of times "brown", "over", and "log" show up in each sentences (i.e. the new column would be equal to 2 and 3).
I know I can do this with str.count, but only for one string at a time and then I would have to add them up
book['count_brown'] = book['sentences'].str.count('brown')
book['count_over'] = book['sentences'].str.count('over')
book['count_log'] = book['sentences'].str.count('log')
book['count'] = book['count_brown']+book['count_over']+book['count_log']
My list of strings I am searching for is over 300 words long so even with a loop it doesn't seem optimal. Is there a better way to do this?
Upvotes: 2
Views: 1522
Reputation: 30605
With the help of nltk
Frequency distribution you can do that very easily i.e
import nltk
lst = ['brown', 'over', 'log']
ndf = df['sentences'].apply(nltk.tokenize.word_tokenize).apply(nltk.FreqDist).apply(pd.Series)[lst].fillna(0)
Output:
brown over log 0 1.0 1.0 0.0 1 2.0 0.0 1.0
For sum
ndf['count'] = ndf.sum(1)
brown over log count 0 1.0 1.0 0.0 2.0 1 2.0 0.0 1.0 3.0
Upvotes: 1
Reputation: 402493
Similar to piRSquared's solution, but uses get_dummies
and sum
for the counts.
df
sentences
0 The brown dog jumped over the big moon
1 The brown fox slid under the brown log
words = ['brown', 'over', 'log']
df = df.sentences.str.extractall('({})'.format('|'.join(words)))\
.iloc[:, 0].str.get_dummies().sum(level=0)
df
brown log over
0 1 0 1
1 2 1 0
If you want row-wise counts of all words in a single column, just sum along the first axis.
df.sum(1)
0 2
1 3
dtype: int64
Upvotes: 1
Reputation: 294258
Ganky!
lst = ['brown', 'over', 'log']
book['sentences'].str.extractall(
'({})'.format('|'.join(lst))
).groupby(level=0)[0].value_counts().unstack(fill_value=0)
0 brown log over
0 1 0 1
1 2 1 0
Upvotes: 3