t__
t__

Reputation: 25

Create New Dictionary from Old Dictionary Pandas DataFrame to calculate entropy

I'm starting to get okay with pandas, but am unsure how to tackle this issue.

I have a column of dictionaries in a pandas dataframe that I am trying to calculate the entropy of.

Each key in the dictionary denotes a cluster and the values are the words in the same cluster. Each row looks like this, with differing amounts of elements in the dictionary. I.e., some dictionaries have two clusters, while some have up to 10:

  {1: ["'stop'", "'avoid'", "'stifle'", "'not'", "'squelch'", "'contain'", "'cover'", "'suppress'"], 2: ["'hold'"], 3: ["'burke'"], 4: ["'hod'"]}

I want to calculate the entropy of each row, but I want the values in each cluster to be seen as the same. As in, ideally the above example would essentially look like this:

{1: ["'stop'", "'stop'", "'stop'", "'stop'", "'stop'", "'stop'", "'stop'", "'stop'"], 2: ["'hold'"], 3: ["'burke'"], 4: ["'hod'"]}

And then finally I hope to take each value from the clusters, lump then into one single list that would look like this so I could run my entropy formula on it:

["'stop'", "'stop'", "'stop'", "'stop'", "'stop'", "'stop'", "'stop'", "'stop'", "'hold'", "'burke'", "'hod'"]

I am struggling to find a way to use pandas or more basic python to create new dictionaries with clusters that look like my second example and then turn those values into a list like my third example.

Upvotes: 0

Views: 138

Answers (1)

andrew_reece
andrew_reece

Reputation: 21274

It's not clear how entropy calculation fits into you specified input and output, but here's one way to get the output you want, using a mix of Pandas and basic Python.

import pandas as pd

data = {1: ["'stop'", "'avoid'", "'stifle'", "'not'", "'squelch'", 
            "'contain'", "'cover'", "'suppress'"], 
        2: ["'hold'"], 
        3: ["'burke'"], 
        4: ["'hod'"]}
s = pd.Series(data)

s
1    ['stop', 'avoid', 'stifle', 'not', 'squelch', ...
2                                             ['hold']
3                                            ['burke']
4                                              ['hod']
dtype: object

Take the first element of each list, and add a space to split on later:

s2 = s.apply(lambda x: (x[0]+" ")*len(x))

s2
1    'stop' 'stop' 'stop' 'stop' 'stop' 'stop' 'sto...
2                                              'hold' 
3                                             'burke' 
4                                               'hod' 
dtype: object

Now pull out each element in each row and combine into one list:

slist = []
for valset in s2:
    # strip the trailing space in each valset
    for val in valset.strip().split(" "):
        slist.extend([val])

slist
["'stop'", "'stop'", "'stop'",  "'stop'", "'stop'",  "'stop'",
 "'stop'", "'stop'",  "'hold'",  "'burke'", "'hod'"]

Upvotes: 1

Related Questions