ctim
ctim

Reputation: 33

Count the number of times multiple substrings appear in dataframe column

I have a dataframe with a list of sentences in one column and am trying to create a new column equal to the number of times a list of strings show up.

For example, the relevant dataframe looks like

book['sentences']
0 The brown dog jumped over the big moon
1 The brown fox slid under the brown log

I'm trying to count the number of times "brown", "over", and "log" show up in each sentences (i.e. the new column would be equal to 2 and 3).

I know I can do this with str.count, but only for one string at a time and then I would have to add them up

book['count_brown'] = book['sentences'].str.count('brown')
book['count_over'] = book['sentences'].str.count('over')
book['count_log'] = book['sentences'].str.count('log')
book['count'] = book['count_brown']+book['count_over']+book['count_log']

My list of strings I am searching for is over 300 words long so even with a loop it doesn't seem optimal. Is there a better way to do this?

Upvotes: 2

Views: 1522

Answers (3)

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

With the help of nltk Frequency distribution you can do that very easily i.e

import nltk 
lst = ['brown', 'over', 'log']
ndf = df['sentences'].apply(nltk.tokenize.word_tokenize).apply(nltk.FreqDist).apply(pd.Series)[lst].fillna(0)

Output:

   brown  over  log
0    1.0   1.0  0.0
1    2.0   0.0  1.0

For sum

ndf['count'] = ndf.sum(1)
   brown  over  log  count
0    1.0   1.0  0.0    2.0
1    2.0   0.0  1.0    3.0

Upvotes: 1

cs95
cs95

Reputation: 402493

Similar to piRSquared's solution, but uses get_dummies and sum for the counts.

df
                                sentences
0  The brown dog jumped over the big moon
1  The brown fox slid under the brown log

words = ['brown', 'over', 'log']
df = df.sentences.str.extractall('({})'.format('|'.join(words)))\
                           .iloc[:, 0].str.get_dummies().sum(level=0)
df
   brown  log  over
0      1    0     1
1      2    1     0

If you want row-wise counts of all words in a single column, just sum along the first axis.

df.sum(1)
0    2
1    3
dtype: int64 

Upvotes: 1

piRSquared
piRSquared

Reputation: 294258

Ganky!

lst = ['brown', 'over', 'log']

book['sentences'].str.extractall(
    '({})'.format('|'.join(lst))
).groupby(level=0)[0].value_counts().unstack(fill_value=0)

0  brown  log  over
0      1    0     1
1      2    1     0

Upvotes: 3

Related Questions