User1291
User1291

Reputation: 8209

transposing and combining panda dataframes

Let's say I have data of the format

option,subcase,prop1,prop2,prop3,...

given as a .csv

I now want to create statistics for each option and additional statistics for each subcase on its own.

If I just wanted to print it all and weren't interested in confidence intervals, it would probably look somewhat like this:

import numpy as np
import pandas as pd
import sys

df = pd.read_csv(sys.argv[1]) # note to self: argv[0] is script file content

options = df.option.unique()
option_data = {}

subcases = df.subcase.unique()
data = {}

for o in options:

    option_data[o] = df[df.option.apply(lambda row: o in row)]
    print(o)
    print(pd.DataFrame.describe(option_data[o]))

    for s in subcases:
        label = o + '_' + s
        data[label] = option_data[o][option_data[o].subcase.apply(lambda row: s in row)]        
        print(label)
        print(pd.DataFrame.describe(data[label]))

    print()

This is, however, terribly hard to read.

How do I best combine the data frames s.t. I end up with frames like

prop1    mean    std    min    25%    ...
A
A_a
A_b
A_c
B
B_a
B_c
...

prop2    mean    std    min    25%    ...
A
A_a
A_b
A_c
B
B_a
B_c
...

I mean, I could manually loop through all frames ... but there has to be something more efficient.

edit

E.g.

option,subcase,cost,time
A,sub1,4,3
A,sub1,2,0
A,sub2,3,8
A,sub2,1,2
B,sub1,13,0
B,sub1,11,0
B,sub2,5,2
B,sub2,3,4

should produce two frames:

cost

,mean,std,min,25%,50%,75%,max
A,2.5,1.290994,1,1.75,2.5,3.25,4
A_sub1,3,1.414214,2,2.5,3,3.5,4
A_sub2,2,1.414214,1,1.5,2,2.5,3
B,8,4.760952,3,4.5,8,11.5,13
B_sub1,12,1.414214,11,11.5,12,12.5,13
B_sub2,4,1.414214,3,3.5,4,4.5,5

and

time

,mean,std,min,25%,50%,75%,max
A,3.25,3.40343,0,1.5,2.5,4.25,8
A_sub1,1.5,2.12132,0,0.75,1.5,2.25,3
A_sub2,5,4.242641,2,3.5,5,6.5,8
B,1.5,1.914854,0,0,1,2.5,4
B_sub1,0,0,0,0,0,0,0
B_sub2,3,1.414214,2,2.5,3,3.5,4 

Where the entries for the A and B rows are calculated based on all their corresponding subcases' entries.

Upvotes: 2

Views: 464

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210942

In [79]: df.groupby(['option','subcase']).describe()
Out[79]:
                cost                                                time
               count  mean       std   min   25%   50%   75%   max count mean       std  min   25%  50%   75%  max
option subcase
A      sub1      2.0   3.0  1.414214   2.0   2.5   3.0   3.5   4.0   2.0  1.5  2.121320  0.0  0.75  1.5  2.25  3.0
       sub2      2.0   2.0  1.414214   1.0   1.5   2.0   2.5   3.0   2.0  5.0  4.242641  2.0  3.50  5.0  6.50  8.0
B      sub1      2.0  12.0  1.414214  11.0  11.5  12.0  12.5  13.0   2.0  0.0  0.000000  0.0  0.00  0.0  0.00  0.0
       sub2      2.0   4.0  1.414214   3.0   3.5   4.0   4.5   5.0   2.0  3.0  1.414214  2.0  2.50  3.0  3.50  4.0

UPDATE:

In [97]: r = df.groupby(['option','subcase']).describe()

In [100]: t = df.groupby('option').describe().set_index(np.array([''] * df['option'].nunique()), append=True)

In [101]: r.append(t).sort_index()
Out[101]:
                cost                                                  time
               count  mean       std   min    25%   50%    75%   max count  mean       std  min   25%  50%   75%  max
option subcase
A                4.0   2.5  1.290994   1.0   1.75   2.5   3.25   4.0   4.0  3.25  3.403430  0.0  1.50  2.5  4.25  8.0
       sub1      2.0   3.0  1.414214   2.0   2.50   3.0   3.50   4.0   2.0  1.50  2.121320  0.0  0.75  1.5  2.25  3.0
       sub2      2.0   2.0  1.414214   1.0   1.50   2.0   2.50   3.0   2.0  5.00  4.242641  2.0  3.50  5.0  6.50  8.0
B                4.0   8.0  4.760952   3.0   4.50   8.0  11.50  13.0   4.0  1.50  1.914854  0.0  0.00  1.0  2.50  4.0
       sub1      2.0  12.0  1.414214  11.0  11.50  12.0  12.50  13.0   2.0  0.00  0.000000  0.0  0.00  0.0  0.00  0.0
       sub2      2.0   4.0  1.414214   3.0   3.50   4.0   4.50   5.0   2.0  3.00  1.414214  2.0  2.50  3.0  3.50  4.0

Upvotes: 2

BENY
BENY

Reputation: 323366

By using pd.concat

df1=df.groupby('option').cost.describe()
df2=df.groupby(['option','subcase']).cost.describe()

df2.index=df2.index.map('_'.join)
pd.concat([df1,df2]).sort_index()


Out[256]: 
        count  mean       std   min    25%   50%    75%   max
A         4.0   2.5  1.290994   1.0   1.75   2.5   3.25   4.0
A_sub1    2.0   3.0  1.414214   2.0   2.50   3.0   3.50   4.0
A_sub2    2.0   2.0  1.414214   1.0   1.50   2.0   2.50   3.0
B         4.0   8.0  4.760952   3.0   4.50   8.0  11.50  13.0
B_sub1    2.0  12.0  1.414214  11.0  11.50  12.0  12.50  13.0
B_sub2    2.0   4.0  1.414214   3.0   3.50   4.0   4.50   5.0

Upvotes: 2

Related Questions