shorttriptomars
shorttriptomars

Reputation: 325

How to iterate list and group by word frequency

I'm trying to find the number of times a word appears in a list from a csv file. I've tried:

df['Size'] = df['Interests'].str.extract('([\S]*[\w])')
sizes = df.groupby('Size').size()

Where Interests is the column I'm analyzing. However, that code does not work. Instead it only prints the first word of each row. So for instance, if my Interests column contains the following entries:

Apple, Banana, Pear, Peach
Banana, Orange
Strawberry, Apple, Banana
Mango, Pear, Orange

Then my Sizes column will contain the following:

Apple        1
Banana       1
Strawberry   1
Mango        1

Instead of

Apple        2
Banana       3
Strawberry   1
Mango        1
Pear         2
Peach        1
Orange       2

How can I fix this? I've tried putting it in a loop, but I get errors. For instance, if I do:

for i in df['Interests']:
      df['Size'] = i.str.extract('([\S]*[\w])')
sizes = df.groupby('Size').size()

I get the error 'float' object has no attribute 'str'.

I've also tried: for i in range(df['Interests']):

But get: TypeError: 'Series' object cannot be interpreted as an integer

Any suggestions on how to fix this? Thank you.

Upvotes: 0

Views: 215

Answers (2)

Andreas
Andreas

Reputation: 9197

You can use collections Counter, which gives you a dictionary with frequencies of items in a list. Since you have a a string representation of the list of words in each row, first split the text into those lists.

from collections import Counter
df['Size'] = df['Interests'].str.split(", ").map(lambda x: Counter(x))
print(df['Size'])

Upvotes: 1

Michael
Michael

Reputation: 2414

I don't think that the basic pandas methods are going to be enough for this problem, since it seems like you want to count words within entries rather than simply counting entries matching some criteria. You'll probably need to write something that iterates through entries and then words within entries. Accumulating the results in a dictionary seems sensible to me. Here's an example:

from collections import defaultdict
counts = defaultdict(int)
for entry in df['Interests'].values:
    for word in entry.split(','):
        # Perform any massaging required here, e.g. such as if you want to be case-insensitive
        counts[word] += 1

# counts now maps words in the entire column to number of counts of those words

Upvotes: 1

Related Questions