KRUPALI RAO
KRUPALI RAO

Reputation: 35

how to find most common word from the entire column of string in python

I'm a newbie in Python and learning it for Data analysis.

I came across a problem where I have a dataset contain column name tags. This Youtube tag is a string containing various words. I need to find out the most commonly used words in the entire column.

Dataset name: youtube_df

column name: tags

tags_split = youtube_df.tags.head(3)
tags_split

import re
from collections import Counter

for t in tags_split:
   #print(t)
   split_strng = re.findall(r"[\w]+",t)
   print(split_strng)
   counter = Counter(split_strng)
   most_common = counter.most_common(3)
   print(most_common)

Output

['Eminem', 'Walk', 'On', 'Water', 'Aftermath', 'Shady', 'Interscope', 'Rap']
[('Eminem', 1), ('Walk', 1), ('On', 1)]
['plush', 'bad', 'unboxing', 'unboxing', 'fan', 'mail', 'idubbbztv', 'idubbbztv2', 'things', 
'best', 'packages', 'plushies', 'chontent', 'chop']
[('unboxing', 2), ('plush', 1), ('bad', 1)]
['racist', 'superman', 'rudy', 'mancuso', 'king', 'bach', 'racist', 'superman', 'love', 'rudy', 
'mancuso', 'poo', 'bear', 'black', 'white', 'official', 'music', 'video', 'iphone', 'x', 'by', 
'pineapple', 'lelepons', 'hannahstocking', 'rudymancuso', 'inanna', 'anwar', 'sarkis', 'shots', 
'shotsstudios', 'alesso', 'anitta', 'brazil', 'Getting', 'My', 'Driver', 's', 'License', 'Lele', 
'Pons']
[('racist', 2), ('superman', 2), ('rudy', 2)]

I want to count how many times a particular word in the entire column is used. so I can predict that this is the most commonly used word in tags.

Can anyone suggest the best way to do so? I really appreciate any of the help.

Upvotes: 0

Views: 253

Answers (2)

the-veloper
the-veloper

Reputation: 302

As far as I understood, you're trying to use the Counter for all of the tags in tags_split

Check out the Counter.update() method in Python standard library.

import re
from collections import Counter

tags_split = youtube_df.tags.head(3)

counter = Counter() # Initializing a counter variable

for tag in tags_split:
   split_strng = re.findall(r"\w+",tag)
   counter.update(split_strng)

most_common = counter.most_common(3)
print(most_common)

Upvotes: 1

IoaTzimas
IoaTzimas

Reputation: 10624

You can try this:

m=[]
for t in tags_split:
   split_strng = re.findall(r"[\w]+",t)
   m.extend(split_strng)

l=Counter(m)
most_common=max([(i,k) for i,k in l.items()], key=lambda x: x[1])

Upvotes: 0

Related Questions