Reputation: 1965
I have a questions on normalizing the counts in a grouped dataframe.
My data looks like this:
import pandas as pd
data = [{'system': 'S1', 'id': '1', 'output': ['apple', 'pear']},
{'system': 'S1', 'id': '2', 'output': []},
{'system': 'S1', 'id': '3', 'output': []},
{'system': 'S2', 'id': '4', 'output': ['apple', 'grape']},
{'system': 'S2', 'id': '5', 'output': ['apple']}]
df = pd.DataFrame(data)
which looks like this in table format:
system id output
0 S1 1 [apple, pear]
1 S1 2 []
2 S1 3 []
3 S2 4 [apple, grape]
4 S2 5 [apple]
How can I get normalized counts per output per system?
It should look like this:
system output perc
S1 apple 0.33
S1 pear 0.33
S2 apple 1.0
S2 grape 0.5
Meaning that apple
and pear
appear in a third of all S1
outputs, apple
appears in all S2
outputs, grape
appears in half of the S2
outputs.
I tried to explode the outputs per system and get separate counts of IDs per system, but merging them loses the output
column:
outputs = df.explode('output').groupby(['system', 'output']).count()
counts = df.groupby('system').agg('count').id
pd.merge(outputs, counts, on="system")
Upvotes: 1
Views: 237
Reputation: 30930
IIUC, SeriesGroupBy.value_counts()
with Serives.value_counts()
and Series.map()
new_df = (df.explode('output').groupby('system')['output'].value_counts()
.reset_index(name='perc')
.assign(perc=lambda x: x['perc'].div(x['system']\
.map(df['system'].value_counts()))))
print(new_df)
system output perc
0 S1 apple 0.333333
1 S1 pear 0.333333
2 S2 apple 1.000000
3 S2 grape 0.500000
times for the sample dataframe
%%timeit
new_df = (df.explode('output').groupby('system')['output'].value_counts()
.reset_index(name='perc')
.assign(perc=lambda x: x['perc'].div(x['system']\
.map(df['system'].value_counts()))))
9.19 ms ± 64.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
(df.explode('output')
.groupby('system')
.apply(lambda x:x['output'].value_counts()/x['id'].nunique())
.reset_index()
)
12.3 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 1
Reputation: 150785
For Pandas 0.25+, we can use explode
:
(df.explode('output')
.groupby('system')
.apply(lambda x:x['output'].value_counts()/x['id'].nunique())
.reset_index()
)
Output:
system level_1 output
0 S1 pear 0.333333
1 S1 apple 0.333333
2 S2 apple 1.000000
3 S2 grape 0.500000
Upvotes: 1