How to iterate list and group by word frequency

Question

I'm trying to find the number of times a word appears in a list from a csv file. I've tried:

df['Size'] = df['Interests'].str.extract('([\S]*[\w])')
sizes = df.groupby('Size').size()

Where Interests is the column I'm analyzing. However, that code does not work. Instead it only prints the first word of each row. So for instance, if my Interests column contains the following entries:

Apple, Banana, Pear, Peach
Banana, Orange
Strawberry, Apple, Banana
Mango, Pear, Orange

Then my Sizes column will contain the following:

Apple        1
Banana       1
Strawberry   1
Mango        1

Instead of

Apple        2
Banana       3
Strawberry   1
Mango        1
Pear         2
Peach        1
Orange       2

How can I fix this? I've tried putting it in a loop, but I get errors. For instance, if I do:

for i in df['Interests']:
      df['Size'] = i.str.extract('([\S]*[\w])')
sizes = df.groupby('Size').size()

I get the error 'float' object has no attribute 'str'.

I've also tried: for i in range(df['Interests']):

But get: TypeError: 'Series' object cannot be interpreted as an integer

Any suggestions on how to fix this? Thank you.

Michael · Accepted Answer

I don't think that the basic pandas methods are going to be enough for this problem, since it seems like you want to count words within entries rather than simply counting entries matching some criteria. You'll probably need to write something that iterates through entries and then words within entries. Accumulating the results in a dictionary seems sensible to me. Here's an example:

from collections import defaultdict
counts = defaultdict(int)
for entry in df['Interests'].values:
    for word in entry.split(','):
        # Perform any massaging required here, e.g. such as if you want to be case-insensitive
        counts[word] += 1

# counts now maps words in the entire column to number of counts of those words

How to iterate list and group by word frequency

Answers (2)

Related Questions