Reputation: 8209
Let's say I have data of the format
option,subcase,prop1,prop2,prop3,...
given as a .csv
I now want to create statistics for each option
and additional statistics for each subcase
on its own.
If I just wanted to print it all and weren't interested in confidence intervals, it would probably look somewhat like this:
import numpy as np
import pandas as pd
import sys
df = pd.read_csv(sys.argv[1]) # note to self: argv[0] is script file content
options = df.option.unique()
option_data = {}
subcases = df.subcase.unique()
data = {}
for o in options:
option_data[o] = df[df.option.apply(lambda row: o in row)]
print(o)
print(pd.DataFrame.describe(option_data[o]))
for s in subcases:
label = o + '_' + s
data[label] = option_data[o][option_data[o].subcase.apply(lambda row: s in row)]
print(label)
print(pd.DataFrame.describe(data[label]))
print()
This is, however, terribly hard to read.
How do I best combine the data frames s.t. I end up with frames like
prop1 mean std min 25% ...
A
A_a
A_b
A_c
B
B_a
B_c
...
prop2 mean std min 25% ...
A
A_a
A_b
A_c
B
B_a
B_c
...
I mean, I could manually loop through all frames ... but there has to be something more efficient.
edit
E.g.
option,subcase,cost,time
A,sub1,4,3
A,sub1,2,0
A,sub2,3,8
A,sub2,1,2
B,sub1,13,0
B,sub1,11,0
B,sub2,5,2
B,sub2,3,4
should produce two frames:
cost
,mean,std,min,25%,50%,75%,max
A,2.5,1.290994,1,1.75,2.5,3.25,4
A_sub1,3,1.414214,2,2.5,3,3.5,4
A_sub2,2,1.414214,1,1.5,2,2.5,3
B,8,4.760952,3,4.5,8,11.5,13
B_sub1,12,1.414214,11,11.5,12,12.5,13
B_sub2,4,1.414214,3,3.5,4,4.5,5
and
time
,mean,std,min,25%,50%,75%,max
A,3.25,3.40343,0,1.5,2.5,4.25,8
A_sub1,1.5,2.12132,0,0.75,1.5,2.25,3
A_sub2,5,4.242641,2,3.5,5,6.5,8
B,1.5,1.914854,0,0,1,2.5,4
B_sub1,0,0,0,0,0,0,0
B_sub2,3,1.414214,2,2.5,3,3.5,4
Where the entries for the A
and B
rows are calculated based on all their corresponding subcases' entries.
Upvotes: 2
Views: 464
Reputation: 210942
In [79]: df.groupby(['option','subcase']).describe()
Out[79]:
cost time
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
option subcase
A sub1 2.0 3.0 1.414214 2.0 2.5 3.0 3.5 4.0 2.0 1.5 2.121320 0.0 0.75 1.5 2.25 3.0
sub2 2.0 2.0 1.414214 1.0 1.5 2.0 2.5 3.0 2.0 5.0 4.242641 2.0 3.50 5.0 6.50 8.0
B sub1 2.0 12.0 1.414214 11.0 11.5 12.0 12.5 13.0 2.0 0.0 0.000000 0.0 0.00 0.0 0.00 0.0
sub2 2.0 4.0 1.414214 3.0 3.5 4.0 4.5 5.0 2.0 3.0 1.414214 2.0 2.50 3.0 3.50 4.0
UPDATE:
In [97]: r = df.groupby(['option','subcase']).describe()
In [100]: t = df.groupby('option').describe().set_index(np.array([''] * df['option'].nunique()), append=True)
In [101]: r.append(t).sort_index()
Out[101]:
cost time
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
option subcase
A 4.0 2.5 1.290994 1.0 1.75 2.5 3.25 4.0 4.0 3.25 3.403430 0.0 1.50 2.5 4.25 8.0
sub1 2.0 3.0 1.414214 2.0 2.50 3.0 3.50 4.0 2.0 1.50 2.121320 0.0 0.75 1.5 2.25 3.0
sub2 2.0 2.0 1.414214 1.0 1.50 2.0 2.50 3.0 2.0 5.00 4.242641 2.0 3.50 5.0 6.50 8.0
B 4.0 8.0 4.760952 3.0 4.50 8.0 11.50 13.0 4.0 1.50 1.914854 0.0 0.00 1.0 2.50 4.0
sub1 2.0 12.0 1.414214 11.0 11.50 12.0 12.50 13.0 2.0 0.00 0.000000 0.0 0.00 0.0 0.00 0.0
sub2 2.0 4.0 1.414214 3.0 3.50 4.0 4.50 5.0 2.0 3.00 1.414214 2.0 2.50 3.0 3.50 4.0
Upvotes: 2
Reputation: 323366
By using pd.concat
df1=df.groupby('option').cost.describe()
df2=df.groupby(['option','subcase']).cost.describe()
df2.index=df2.index.map('_'.join)
pd.concat([df1,df2]).sort_index()
Out[256]:
count mean std min 25% 50% 75% max
A 4.0 2.5 1.290994 1.0 1.75 2.5 3.25 4.0
A_sub1 2.0 3.0 1.414214 2.0 2.50 3.0 3.50 4.0
A_sub2 2.0 2.0 1.414214 1.0 1.50 2.0 2.50 3.0
B 4.0 8.0 4.760952 3.0 4.50 8.0 11.50 13.0
B_sub1 2.0 12.0 1.414214 11.0 11.50 12.0 12.50 13.0
B_sub2 2.0 4.0 1.414214 3.0 3.50 4.0 4.50 5.0
Upvotes: 2