Reputation: 133
I am using python / pandas.
I have a dataframe like this:
date id my_column
0 31.07.20 128909 ['hey', 'hi']
1 31.07.20 128914 ['hi']
3 31.07.20 853124 ['hi', 'hello', 'hey']
4 30.07.20 123456 ['hey']
...
The dataframe over 1.000.000 rows long. I want the top 10 most common words in the my_column column.
Appreciate any help.
Upvotes: 1
Views: 690
Reputation: 862511
Use Series.explode
with Series.value_counts
, by default are values sorted, so for top10 need first 10 index values:
out = df['my_column'].explode().value_counts().index[:10].tolist()
Or you can use pure python solution for flatten and count top10:
from collections import Counter
from itertools import chain
c = Counter(chain.from_iterable(df['my_column']))
out = [a for a, b in c.most_common(10)]
Upvotes: 4