How to calculate most frequently occurring words in pandas dataframe column by year?

Question

I have a pandas dataframe that contains a column 'reviews' and a column 'year'. I would like to view the top 100 most frequently occurring words in the reviews column, but filtered by year. So, I want to know the top 100 from 2002, 2003, 2004, and so on, all the way up to 2017.

import pandas as pd
from nltk.corpus import stopwords

df=pd.read_csv('./reviews.csv')

stop = stopwords.words('english')

commonwords = pd.Series(' '.join(df['reviews']).lower().split()).value_counts()[:100]

print(commonwords)

df.to_csv('commonwords.csv', index=False)

The above code works but it only gives me the top 100 most frequently occurring words across all years.

jezrael · Accepted Answer

You can use:

df = pd.DataFrame({'reviews':['He writer in me great great me',
                        'great ambience the coffee was great',
                        'great coffee'],
                   'year':[2002,2004,2004]})
print (df)

                               reviews  year
0       He writer in me great great me  2002
1  great ambience the coffee was great  2004
2                         great coffee  2004

#change for 100 for top100 in real data
N = 3
df1 =  (df.set_index('year')['reviews']
          .str.lower()
          .str.split(expand=True)
          .stack()
          .groupby(level=0)
          .value_counts()
          .groupby(level=0)
          .head(N)
          .rename_axis(('year','words'))
          .reset_index(name='count'))

print (df1)
   year     words  count
0  2002     great      2
1  2002        me      2
2  2002        he      1
3  2004     great      3
4  2004    coffee      2
5  2004  ambience      1

Explanation:

Convert values to lowercase by Series.str.lower with Series.str.split for DataFrame
Reshape by DataFrame.stack for MultiIndex Series
count values per groups with SeriesGroupBy.value_counts, values are sorted
get top N values by GroupBy.head
Data cleaning - DataFrame.rename_axis with DataFrame.reset_index

How to calculate most frequently occurring words in pandas dataframe column by year?

Answers (2)

Related Questions