Reputation: 1424
All the other answers I could find specifically referred to aggregating across all of the nested lists within a list of lists, where as I'm looking to aggregate separately for each list.
I currently have a list of lists:
master_list = [[a,a,b,b,b,c,c,c], [d,d,d,a,a,a,c,c,c], [c,c,c,a,a,f,f,f]]
I want to return a dictionary or Counter() objects for each list with a loop:
counter1 = {'a': 2, 'b': 3, 'c': 3}
counter2 = {'d': 3, 'a': 3, 'c': 3}
counter3 = {'c': 3, 'a': 2, 'f': 3}
Currently, I'm returning something that looks like this using a loop - it's not exactly what I want as it's all lumped together and I'm having trouble accessing the counter objects separately:
Input:
count = Counter()
for lists in master_list:
for words in lists:
count[words] += 1
Output:
Counter({'a': 2, 'b': 3, 'c': 3})
Counter({'d': 3, 'a': 3, 'c': 3})
Counter({'c': 3, 'a': 2, 'f': 3})
The problem with the above is that I can't seem to figure out a way to grab each Counter individually, because I'm trying to create a pandas dataframe for each one of these dictionaries/counter objects. I'm trying to do this programmatically because my "master_list" has hundreds of lists within it and I want to return a dataframe that shows the frequency of the elements for each separate list. In the end I would have a separate dataframe and Counter object for every list within "master-list"
Currently I have something that returns only 1 dataframe:
Input:
table = pandas.DataFrame(count.items())
table.columns = ['Word', 'Frequency']
table.sort_values(by=['Frequency'], ascending = [False])
Output:
Word Frequency
the 542
and 125
or 45
. .
. .
. .
. .
Any insight would be appreciated - also, any tips on handling Counter() objects seperately would be appreciated.
Upvotes: 1
Views: 1650
Reputation: 210882
IMO, this question can show the real pandas's power. Let's do the following - instead of counting boring [a,a,b,b,b,c,c,c], [d,d,d,a,a,a,c,c,c], [c,c,c,a,a,f,f,f]
we will count the frequency of words in real books. I've chosen the following three: 'Faust', 'Hamlet', 'Macbeth'.
Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from collections import defaultdict
import string
import requests
import pandas as pd
books = {
'Faust': 'http://www.gutenberg.org/cache/epub/2229/pg2229.txt',
'Hamlet': 'http://www.gutenberg.org/cache/epub/2265/pg2265.txt',
'Macbeth': 'http://www.gutenberg.org/cache/epub/2264/pg2264.txt',
}
# prepare translate table, which will remove all punctuations and digits
chars2remove = list(string.punctuation + string.digits)
transl_tab = str.maketrans(dict(zip(chars2remove, list(' ' * len(chars2remove)))))
# replace 'carriage return' and 'new line' characters with spaces
transl_tab[10] = ' '
transl_tab[13] = ' '
def tokenize(s):
return s.translate(transl_tab).lower().split()
def get_data(url):
r = requests.get(url)
if r.status_code == requests.codes.ok:
return r.text
else:
r.raise_for_status()
# generate DF containing words from books
d = defaultdict(list)
for name, url in books.items():
d[name] = tokenize(get_data(url))
df = pd.concat([pd.DataFrame({'book': name, 'word': tokenize(get_data(url))})
for name, url in books.items()], ignore_index=True)
# let's count the frequency
frequency = df.groupby(['book','word']) \
.size() \
.sort_values(ascending=False)
# output
print(frequency.head(30))
print('[Macbeth]: macbeth\t', frequency.loc['Macbeth', 'macbeth'])
print('[Hamlet]: nay\t', frequency.loc['Hamlet', 'nay'])
print('[Faust]: faust\t', frequency.loc['Faust', 'faust'])
Output:
book word
Hamlet the 1105
and 919
Faust und 918
Hamlet to 760
Macbeth the 759
Hamlet of 698
Faust ich 691
die 668
der 610
Macbeth and 602
Hamlet you 588
i 560
a 542
my 506
Macbeth to 460
Hamlet it 439
Macbeth of 426
Faust nicht 426
Hamlet in 409
Faust das 403
ein 399
zu 380
Hamlet that 379
Faust in 365
ist 363
Hamlet is 346
Macbeth i 344
Hamlet ham 337
this 328
not 316
dtype: int64
[Macbeth]: macbeth 67
[Hamlet]: nay 27
[Faust]: faust 272
Upvotes: 1
Reputation: 8071
You can create a list and append the counters to it. (Also, you are using Counter
, but still doing the counts yourself, which is unnecessary.)
master_list = [[a,a,b,b,b,c,c,c], [d,d,d,a,a,a,c,c,c], [c,c,c,a,a,f,f,f]]
counters = []
for list_ in master_list:
counters.append(Counter(list_))
Now you can address each separate list with counters[i]
.
Upvotes: 1