Using Counter() function in python

I have an excel file with many rows and columns. I want to do the following. First, I want to filter the rows based on a text match. Second, I want to choose a particular column and generate word frequency for ALL THE WORDS in that column. Third, I want to graph the word and frequency.

I have figured out the first part. My question is how to apply Counter() on a dataframe. If I just use Counter(df), it returns an error. So, I used the following code to convert each row into a list and then applied Counter. When I do this, I get the word frequency for each row separately (if I use counter within the for loop, else I get the word frequency for just one row). However, I want a word count for all the rows put together. Appreciate any inputs. Thanks! The following is an example data.

product      review
a            Great Product
a            Delivery was fast 
a            Product received in good condition
a            Fast delivery but useless product
b            Dont recommend
b            I love it
b            Please dont buy
b            Second purchase

My desired output is like this: for product a - (product,3),(delivery,2)(fast,2) etc..my current output is like (great,1), (product,1) for the first row.

This is the code I used.

strdata = column.values.tolist() 
tokens = [tokenizer.tokenize(str(i)) for i in strdata] 

cleaned_list = []
for m in tokens:
    stopped = [i for i in m if str(i).lower() not in stop_words] 
    stemmed = [stemmer.stem(i) for i in stopped] 
    cleaned_list.append(stopped) #append stemmed words to list
    count = Counter(stemmed)
    print(count.most_common(10))

Upvotes: 0

Views: 454

Answers (2)

linog
linog

Reputation: 6226

You can use the following function. The idea is

  1. Group your data by byvar. Combine every words in yvar as a list.
  2. Apply Counter and, if you want, select the most common
  3. Explode to get a long formatted dataframe (easier to analyze afterwards)
  4. just keep the relevent columns (word and count in a new dataframe) :
from collections import Counter
import pandas as pd

def count_words_by(data, yvar, byvar):
    cw = pd.DataFrame({'counter' : data
        .groupby(byvar)
        .apply(lambda s: ' '.join(s[yvar]).split())
        .apply(lambda s: Counter(s))
#        .apply(lambda s: s.most_common(10)) #uncomment this line if you want the top 10 words
        .explode()}
    )
    cw[['word','count']] = pd.DataFrame(cw['counter'].tolist(), index=cw.index)
    cw_red = cw[['word','count']].reset_index()

    return cw_red

count_words_by(data = df, yvar = "review", byvar = "product")

where I assume you start from there:

product      review
a            Great Product
a            Delivery was fast 
a            Product received in good condition
a            Fast delivery but useless product
b            Dont recommend
b            I love it
b            Please dont buy
b            Second purchase

Upvotes: 0

Ynjxsjmh
Ynjxsjmh

Reputation: 30052

Firstly, using groupby concatenate strings from same group.

Secondly, apply Counter() on joined strings.

joined = df.groupby('product', as_index=False).agg({'review' : ' '.join})
joined['count'] = joined.apply(lambda x: collections.Counter(x['review'].split(' ')), axis=1)
# print(joined)

  product                                             review                                              count
0       a  Great Product Delivery was fast Product receiv...  {'Great': 1, 'Product': 2, 'Delivery': 1, 'was...
1       b  Dont recommend I love it Please dont buy Secon...  {'Dont': 1, 'recommend': 1, 'I': 1, 'love': 1,...

Upvotes: 1

Related Questions