transposing and combining panda dataframes

Question

Let's say I have data of the format

option,subcase,prop1,prop2,prop3,...

given as a .csv

I now want to create statistics for each option and additional statistics for each subcase on its own.

If I just wanted to print it all and weren't interested in confidence intervals, it would probably look somewhat like this:

import numpy as np
import pandas as pd
import sys

df = pd.read_csv(sys.argv[1]) # note to self: argv[0] is script file content

options = df.option.unique()
option_data = {}

subcases = df.subcase.unique()
data = {}

for o in options:

    option_data[o] = df[df.option.apply(lambda row: o in row)]
    print(o)
    print(pd.DataFrame.describe(option_data[o]))

    for s in subcases:
        label = o + '_' + s
        data[label] = option_data[o][option_data[o].subcase.apply(lambda row: s in row)]        
        print(label)
        print(pd.DataFrame.describe(data[label]))

    print()

This is, however, terribly hard to read.

How do I best combine the data frames s.t. I end up with frames like

prop1    mean    std    min    25%    ...
A
A_a
A_b
A_c
B
B_a
B_c
...

prop2    mean    std    min    25%    ...
A
A_a
A_b
A_c
B
B_a
B_c
...

I mean, I could manually loop through all frames ... but there has to be something more efficient.

edit

E.g.

option,subcase,cost,time
A,sub1,4,3
A,sub1,2,0
A,sub2,3,8
A,sub2,1,2
B,sub1,13,0
B,sub1,11,0
B,sub2,5,2
B,sub2,3,4

should produce two frames:

cost

,mean,std,min,25%,50%,75%,max
A,2.5,1.290994,1,1.75,2.5,3.25,4
A_sub1,3,1.414214,2,2.5,3,3.5,4
A_sub2,2,1.414214,1,1.5,2,2.5,3
B,8,4.760952,3,4.5,8,11.5,13
B_sub1,12,1.414214,11,11.5,12,12.5,13
B_sub2,4,1.414214,3,3.5,4,4.5,5

and

time

,mean,std,min,25%,50%,75%,max
A,3.25,3.40343,0,1.5,2.5,4.25,8
A_sub1,1.5,2.12132,0,0.75,1.5,2.25,3
A_sub2,5,4.242641,2,3.5,5,6.5,8
B,1.5,1.914854,0,0,1,2.5,4
B_sub1,0,0,0,0,0,0,0
B_sub2,3,1.414214,2,2.5,3,3.5,4

Where the entries for the A and B rows are calculated based on all their corresponding subcases' entries.

MaxU - stand with Ukraine · Accepted Answer

In [79]: df.groupby(['option','subcase']).describe()
Out[79]:
                cost                                                time
               count  mean       std   min   25%   50%   75%   max count mean       std  min   25%  50%   75%  max
option subcase
A      sub1      2.0   3.0  1.414214   2.0   2.5   3.0   3.5   4.0   2.0  1.5  2.121320  0.0  0.75  1.5  2.25  3.0
       sub2      2.0   2.0  1.414214   1.0   1.5   2.0   2.5   3.0   2.0  5.0  4.242641  2.0  3.50  5.0  6.50  8.0
B      sub1      2.0  12.0  1.414214  11.0  11.5  12.0  12.5  13.0   2.0  0.0  0.000000  0.0  0.00  0.0  0.00  0.0
       sub2      2.0   4.0  1.414214   3.0   3.5   4.0   4.5   5.0   2.0  3.0  1.414214  2.0  2.50  3.0  3.50  4.0

UPDATE:

In [97]: r = df.groupby(['option','subcase']).describe()

In [100]: t = df.groupby('option').describe().set_index(np.array([''] * df['option'].nunique()), append=True)

In [101]: r.append(t).sort_index()
Out[101]:
                cost                                                  time
               count  mean       std   min    25%   50%    75%   max count  mean       std  min   25%  50%   75%  max
option subcase
A                4.0   2.5  1.290994   1.0   1.75   2.5   3.25   4.0   4.0  3.25  3.403430  0.0  1.50  2.5  4.25  8.0
       sub1      2.0   3.0  1.414214   2.0   2.50   3.0   3.50   4.0   2.0  1.50  2.121320  0.0  0.75  1.5  2.25  3.0
       sub2      2.0   2.0  1.414214   1.0   1.50   2.0   2.50   3.0   2.0  5.00  4.242641  2.0  3.50  5.0  6.50  8.0
B                4.0   8.0  4.760952   3.0   4.50   8.0  11.50  13.0   4.0  1.50  1.914854  0.0  0.00  1.0  2.50  4.0
       sub1      2.0  12.0  1.414214  11.0  11.50  12.0  12.50  13.0   2.0  0.00  0.000000  0.0  0.00  0.0  0.00  0.0
       sub2      2.0   4.0  1.414214   3.0   3.50   4.0   4.50   5.0   2.0  3.00  1.414214  2.0  2.50  3.0  3.50  4.0

transposing and combining panda dataframes

Answers (2)

Related Questions