How to normalize pandas multiindex dataframe?

Question

I have a questions on normalizing the counts in a grouped dataframe.

My data looks like this:

import pandas as pd

data = [{'system': 'S1', 'id': '1', 'output': ['apple', 'pear']},
    {'system': 'S1', 'id': '2', 'output': []},
    {'system': 'S1', 'id': '3', 'output': []},
    {'system': 'S2', 'id': '4', 'output': ['apple', 'grape']},
    {'system': 'S2', 'id': '5', 'output': ['apple']}] 

df = pd.DataFrame(data)

which looks like this in table format:

  system id          output
0     S1  1   [apple, pear]
1     S1  2              []
2     S1  3              []
3     S2  4  [apple, grape]
4     S2  5         [apple]

How can I get normalized counts per output per system?

It should look like this:

system  output  perc
S1      apple   0.33
S1      pear    0.33
S2      apple   1.0
S2      grape   0.5

Meaning that apple and pear appear in a third of all S1 outputs, apple appears in all S2 outputs, grape appears in half of the S2 outputs.

I tried to explode the outputs per system and get separate counts of IDs per system, but merging them loses the output column:

outputs = df.explode('output').groupby(['system', 'output']).count()                                                                                                                                        
counts = df.groupby('system').agg('count').id
pd.merge(outputs, counts, on="system")

Quang Hoang · Accepted Answer

For Pandas 0.25+, we can use explode:

(df.explode('output')
   .groupby('system')
   .apply(lambda x:x['output'].value_counts()/x['id'].nunique())
   .reset_index()
)

Output:

  system level_1    output
0     S1    pear  0.333333
1     S1   apple  0.333333
2     S2   apple  1.000000
3     S2   grape  0.500000

How to normalize pandas multiindex dataframe?

Answers (2)

Related Questions