Pyd
Pyd

Reputation: 6159

How to get the count of occurence of a list of keywords on a datacolumn in a dataframe in python

 my_list=["one","is"]

 df
 Out[6]:
        Name    Story
   0    Kumar   Kumar is one of the great player in his team
   1    Ravi    Ravi is a good poet
   2    Ram     Ram drives well

if anyone of the items in my_list is present in the "Story" column I need to get the no of occurrence for all the items.

 my_desired_output

 new_df
 word     count
 one       1
 is        2

I achieved extracting the row which are having anyone of the items in my_list using

mask=df1["Story"].str.contains('|'.join(my_list),na=False) but now I am trying get the counts of each word in my_list

Upvotes: 1

Views: 92

Answers (1)

jezrael
jezrael

Reputation: 862761

You can use str.split with stack for Series of words first:

a = df['Story'].str.split(expand=True).stack()
print (a)
0  0     Kumar
   1        is
   2       one
   3        of
   4       the
   5     great
   6    player
   7        in
   8       his
   9      team
1  0      Ravi
   1        is
   2         a
   3      good
   4      poet
2  0       Ram
   1    drives
   2      well
dtype: object

Then filter by boolean indexing with isin, get value_counts and for DataFrame add rename_axis and reset_index:

df = a[a.isin(my_list)].value_counts().rename_axis('word').reset_index(name='count')
print (df)
  word  count
0   is      2
1  one      1

Another solution with creating list of all words by str.split, then fllaten by from_iterable, use Counter and last create DataFrame by constructor:

from collections import Counter
from  itertools import chain

my_list=["one","is"]

a = list(chain.from_iterable(df['Story'].str.split().values.tolist()))
print (a)
['Kumar', 'is', 'one', 'of', 'the', 'great', 'player', 
 'in', 'his', 'team', 'Ravi', 'is', 'a', 'good', 'poet', 'Ram', 'drives', 'well']

b = Counter([x for x in a if x in my_list])
print (b)
Counter({'is': 2, 'one': 1})

df = pd.DataFrame({'word':list(b.keys()),'count':list(b.values())}, columns=['word','count'])
print (df)
  word  count
0  one      1
1   is      2

Upvotes: 1

Related Questions