mk8efz
mk8efz

Reputation: 1424

Creating a separate Counter() object and Pandas DataFrame for each list within a list of lists

All the other answers I could find specifically referred to aggregating across all of the nested lists within a list of lists, where as I'm looking to aggregate separately for each list.

I currently have a list of lists:

master_list = [[a,a,b,b,b,c,c,c], [d,d,d,a,a,a,c,c,c], [c,c,c,a,a,f,f,f]]

I want to return a dictionary or Counter() objects for each list with a loop:

counter1 = {'a': 2, 'b': 3, 'c': 3}
counter2 = {'d': 3, 'a': 3, 'c': 3}
counter3 = {'c': 3, 'a': 2, 'f': 3}

Currently, I'm returning something that looks like this using a loop - it's not exactly what I want as it's all lumped together and I'm having trouble accessing the counter objects separately:

Input:

count = Counter()
for lists in master_list:
    for words in lists:
    count[words] += 1


Output:

Counter({'a': 2, 'b': 3, 'c': 3})
Counter({'d': 3, 'a': 3, 'c': 3})
Counter({'c': 3, 'a': 2, 'f': 3})

The problem with the above is that I can't seem to figure out a way to grab each Counter individually, because I'm trying to create a pandas dataframe for each one of these dictionaries/counter objects. I'm trying to do this programmatically because my "master_list" has hundreds of lists within it and I want to return a dataframe that shows the frequency of the elements for each separate list. In the end I would have a separate dataframe and Counter object for every list within "master-list"

Currently I have something that returns only 1 dataframe:

Input:

table = pandas.DataFrame(count.items())
table.columns = ['Word', 'Frequency']
table.sort_values(by=['Frequency'], ascending = [False])


Output:

Word   Frequency
the    542
and    125
or     45
.      .
.      .
.      .
.      .

Any insight would be appreciated - also, any tips on handling Counter() objects seperately would be appreciated.

Upvotes: 1

Views: 1650

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

IMO, this question can show the real pandas's power. Let's do the following - instead of counting boring [a,a,b,b,b,c,c,c], [d,d,d,a,a,a,c,c,c], [c,c,c,a,a,f,f,f] we will count the frequency of words in real books. I've chosen the following three: 'Faust', 'Hamlet', 'Macbeth'.

Code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from collections import defaultdict
import string
import requests
import pandas as pd

books = {
  'Faust': 'http://www.gutenberg.org/cache/epub/2229/pg2229.txt',
  'Hamlet': 'http://www.gutenberg.org/cache/epub/2265/pg2265.txt',
  'Macbeth': 'http://www.gutenberg.org/cache/epub/2264/pg2264.txt',
}

# prepare translate table, which will remove all punctuations and digits
chars2remove = list(string.punctuation + string.digits)
transl_tab = str.maketrans(dict(zip(chars2remove, list(' ' * len(chars2remove)))))
# replace 'carriage return' and 'new line' characters with spaces
transl_tab[10] = ' '
transl_tab[13] = ' '


def tokenize(s):
    return s.translate(transl_tab).lower().split()

def get_data(url):
    r = requests.get(url)
    if r.status_code == requests.codes.ok:
        return r.text
    else:
        r.raise_for_status()

# generate DF containing words from books
d = defaultdict(list)
for name, url in books.items():
    d[name] = tokenize(get_data(url))

df = pd.concat([pd.DataFrame({'book': name, 'word': tokenize(get_data(url))})
                for name, url in books.items()], ignore_index=True)

# let's count the frequency
frequency = df.groupby(['book','word']) \
              .size() \
              .sort_values(ascending=False)

# output
print(frequency.head(30))
print('[Macbeth]: macbeth\t', frequency.loc['Macbeth', 'macbeth'])
print('[Hamlet]: nay\t', frequency.loc['Hamlet', 'nay'])
print('[Faust]: faust\t', frequency.loc['Faust', 'faust'])

Output:

book     word
Hamlet   the      1105
         and       919
Faust    und       918
Hamlet   to        760
Macbeth  the       759
Hamlet   of        698
Faust    ich       691
         die       668
         der       610
Macbeth  and       602
Hamlet   you       588
         i         560
         a         542
         my        506
Macbeth  to        460
Hamlet   it        439
Macbeth  of        426
Faust    nicht     426
Hamlet   in        409
Faust    das       403
         ein       399
         zu        380
Hamlet   that      379
Faust    in        365
         ist       363
Hamlet   is        346
Macbeth  i         344
Hamlet   ham       337
         this      328
         not       316
dtype: int64

[Macbeth]: macbeth      67
[Hamlet]: nay    27
[Faust]: faust   272

Upvotes: 1

Paulo Almeida
Paulo Almeida

Reputation: 8071

You can create a list and append the counters to it. (Also, you are using Counter, but still doing the counts yourself, which is unnecessary.)

master_list = [[a,a,b,b,b,c,c,c], [d,d,d,a,a,a,c,c,c], [c,c,c,a,a,f,f,f]]
counters = []
for list_ in master_list:
    counters.append(Counter(list_))

Now you can address each separate list with counters[i].

Upvotes: 1

Related Questions