Reputation: 101
I have a pandas dataframe that contains a column 'reviews' and a column 'year'. I would like to view the top 100 most frequently occurring words in the reviews column, but filtered by year. So, I want to know the top 100 from 2002, 2003, 2004, and so on, all the way up to 2017.
import pandas as pd
from nltk.corpus import stopwords
df=pd.read_csv('./reviews.csv')
stop = stopwords.words('english')
commonwords = pd.Series(' '.join(df['reviews']).lower().split()).value_counts()[:100]
print(commonwords)
df.to_csv('commonwords.csv', index=False)
The above code works but it only gives me the top 100 most frequently occurring words across all years.
Upvotes: 2
Views: 1883
Reputation: 863451
You can use:
df = pd.DataFrame({'reviews':['He writer in me great great me',
'great ambience the coffee was great',
'great coffee'],
'year':[2002,2004,2004]})
print (df)
reviews year
0 He writer in me great great me 2002
1 great ambience the coffee was great 2004
2 great coffee 2004
#change for 100 for top100 in real data
N = 3
df1 = (df.set_index('year')['reviews']
.str.lower()
.str.split(expand=True)
.stack()
.groupby(level=0)
.value_counts()
.groupby(level=0)
.head(N)
.rename_axis(('year','words'))
.reset_index(name='count'))
print (df1)
year words count
0 2002 great 2
1 2002 me 2
2 2002 he 1
3 2004 great 3
4 2004 coffee 2
5 2004 ambience 1
Explanation:
Series.str.lower
with Series.str.split
for DataFrame
DataFrame.stack
for MultiIndex Series
SeriesGroupBy.value_counts
, values are sortedN
values by GroupBy.head
DataFrame.rename_axis
with DataFrame.reset_index
Upvotes: 2
Reputation: 7332
Before creating the commonwords data-frame, you can create another dataframe using groupby operation like so df.groupby(['year', 'reviews'])
. Then use the reset_index operation this so that you can use it for filtering the top 100.
Other than resetting index, you can also refer to the answers in this question for further idea's.
Upvotes: 1