Reputation: 27
I have an excel file with many rows and columns. I want to do the following. First, I want to filter the rows based on a text match. Second, I want to choose a particular column and generate word frequency for ALL THE WORDS in that column. Third, I want to graph the word and frequency.
I have figured out the first part. My question is how to apply Counter() on a dataframe. If I just use Counter(df), it returns an error. So, I used the following code to convert each row into a list and then applied Counter. When I do this, I get the word frequency for each row separately (if I use counter within the for loop, else I get the word frequency for just one row). However, I want a word count for all the rows put together. Appreciate any inputs. Thanks! The following is an example data.
product review
a Great Product
a Delivery was fast
a Product received in good condition
a Fast delivery but useless product
b Dont recommend
b I love it
b Please dont buy
b Second purchase
My desired output is like this: for product a - (product,3),(delivery,2)(fast,2) etc..my current output is like (great,1), (product,1) for the first row.
This is the code I used.
strdata = column.values.tolist()
tokens = [tokenizer.tokenize(str(i)) for i in strdata]
cleaned_list = []
for m in tokens:
stopped = [i for i in m if str(i).lower() not in stop_words]
stemmed = [stemmer.stem(i) for i in stopped]
cleaned_list.append(stopped) #append stemmed words to list
count = Counter(stemmed)
print(count.most_common(10))
Upvotes: 0
Views: 454
Reputation: 6226
You can use the following function. The idea is
byvar
. Combine every words in yvar
as a list.Counter
and, if you want, select the most commonword
and count
in a new dataframe)
:from collections import Counter
import pandas as pd
def count_words_by(data, yvar, byvar):
cw = pd.DataFrame({'counter' : data
.groupby(byvar)
.apply(lambda s: ' '.join(s[yvar]).split())
.apply(lambda s: Counter(s))
# .apply(lambda s: s.most_common(10)) #uncomment this line if you want the top 10 words
.explode()}
)
cw[['word','count']] = pd.DataFrame(cw['counter'].tolist(), index=cw.index)
cw_red = cw[['word','count']].reset_index()
return cw_red
count_words_by(data = df, yvar = "review", byvar = "product")
where I assume you start from there:
product review
a Great Product
a Delivery was fast
a Product received in good condition
a Fast delivery but useless product
b Dont recommend
b I love it
b Please dont buy
b Second purchase
Upvotes: 0
Reputation: 30052
Firstly, using groupby concatenate strings from same group.
Secondly, apply Counter()
on joined strings.
joined = df.groupby('product', as_index=False).agg({'review' : ' '.join})
joined['count'] = joined.apply(lambda x: collections.Counter(x['review'].split(' ')), axis=1)
# print(joined)
product review count
0 a Great Product Delivery was fast Product receiv... {'Great': 1, 'Product': 2, 'Delivery': 1, 'was...
1 b Dont recommend I love it Please dont buy Secon... {'Dont': 1, 'recommend': 1, 'I': 1, 'love': 1,...
Upvotes: 1