ThePyGuy
ThePyGuy

Reputation: 18406

Pandas stack multiple columns to a single column

I have following DataFrame:

                    ETHNIC                       RACE        AGE       TRT01A
0   NOT HISPANIC OR LATINO                      WHITE  31.824778  Treatment B
1   NOT HISPANIC OR LATINO                      WHITE  31.381246      Placebo
2       HISPANIC OR LATINO                      WHITE  45.522245  Treatment A
3       HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN  42.910335  Treatment B
4   NOT HISPANIC OR LATINO                      WHITE  31.381246      Placebo
5   NOT HISPANIC OR LATINO                      WHITE  38.045175  Treatment B
6       HISPANIC OR LATINO                      WHITE  39.337440      Placebo
7   NOT HISPANIC OR LATINO                      WHITE  47.121150      Placebo
8   NOT HISPANIC OR LATINO                      WHITE  38.203970  Treatment A
9   NOT HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN  22.926762      Placebo
10      HISPANIC OR LATINO                      WHITE  45.226557  Treatment B
11      HISPANIC OR LATINO                      WHITE  32.112252      Placebo

Just copy above dataframe to clipboard and run df=pd.read_clipboard('\s\s+') to get the dataframe into a variable.

out = (df.groupby(['TRT01A','ETHNIC', 'RACE'])['AGE']
       .agg(mean=np.mean, 
            n='count', 
            deviation=np.std,
            Q1=lambda x: np.percentile(x, 0.25)
            )
       .T.unstack().unstack(0)
       )

I performed some aggregates in the above dataframe, and transposed, and successively unstacked them to get the following result:

TRT01A                                                        Placebo  Treatment A  Treatment B
ETHNIC                 RACE                                                                    
HISPANIC OR LATINO     BLACK OR AFRICAN AMERICAN mean             NaN          NaN    42.910335
                                                 n                NaN          NaN     1.000000
                                                 deviation        NaN          NaN          NaN
                                                 Q1               NaN          NaN    42.910335
                       WHITE                     mean       35.724846    45.522245    45.226557
                                                 n           2.000000     1.000000     1.000000
                                                 deviation   5.108979          NaN          NaN
                                                 Q1         32.130315    45.522245    45.226557
NOT HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN mean       22.926762          NaN          NaN
                                                 n           1.000000          NaN          NaN
                                                 deviation        NaN          NaN          NaN
                                                 Q1         22.926762          NaN          NaN
                       WHITE                     mean       36.627881    38.203970    34.934976
                                                 n           3.000000     1.000000     2.000000
                                                 deviation   9.087438          NaN     4.398485
                                                 Q1         31.381246    38.203970    31.840329

Now, I want to unstack all the indices to get the following structure (i.e. inserting NaN rows for all the index columns from first to second last, alongwith Level column denoting the level of the index):

                             Placebo  Treatment A  Treatment B  Level
HISPANIC OR LATINO               NaN          NaN          NaN      0 <---
BLACK OR AFRICAN AMERICAN        NaN          NaN          NaN      1 <---
mean                             NaN          NaN    42.910335      2
n                                NaN          NaN     1.000000      2
deviation                        NaN          NaN          NaN      2
Q1                               NaN          NaN    42.910335      2
WHITE                            NaN          NaN          NaN      1 <---
mean                       35.724846    45.522245    45.226557      2
n                           2.000000     1.000000     1.000000      2
deviation                   5.108979          NaN          NaN      2
Q1                         32.130315    45.522245    45.226557      2
NOT HISPANIC OR LATINO           NaN          NaN          NaN      0 <---
BLACK OR AFRICAN AMERICAN        NaN          NaN          NaN      1 <---
mean                       22.926762          NaN          NaN      2
n                           1.000000          NaN          NaN      2
deviation                        NaN          NaN          NaN      2
Q1                         22.926762          NaN          NaN      2
WHITE                            NaN          NaN          NaN      1 <---
mean                       36.627881    38.203970    34.934976      2
n                           3.000000     1.000000     2.000000      2
deviation                   9.087438          NaN     4.398485      2
Q1                         31.381246    38.203970    31.840329      2   

This question is identical to the previous question that I asked, but the problem is, there can be from 1 to 4 indices columns after aggregating, (i.e. aggregate may be applied on from 1 to 5 columns), and it's being difficult to use the same previous solution in this scenario.

Upvotes: 2

Views: 216

Answers (1)

jezrael
jezrael

Reputation: 862571

Use custom function with DataFrame.append first with custom DataFrame filled by default NaN values:

def f(x):
    names = pd.DataFrame(index=x.name, columns=x.columns).assign(Level=[0,1])
    #print (names)
    return names.append(x.reset_index(level=[0,1], drop=True).assign(Level=2))

out = out.groupby(level=[0,1], group_keys=False).apply(f)

And then remove duplicated 0 Levels:

out = out[~out.index.duplicated() | out['Level'].isin([1,2])]

print (out)
TRT01A                       Placebo  Treatment A  Treatment B  Level
HISPANIC OR LATINO               NaN          NaN          NaN      0
BLACK OR AFRICAN AMERICAN        NaN          NaN          NaN      1
mean                             NaN          NaN    42.910335      2
n                                NaN          NaN     1.000000      2
deviation                        NaN          NaN          NaN      2
Q1                               NaN          NaN    42.910335      2
WHITE                            NaN          NaN          NaN      1
mean                       35.724846    45.522245    45.226557      2
n                           2.000000     1.000000     1.000000      2
deviation                   5.108979          NaN          NaN      2
Q1                         32.130315    45.522245    45.226557      2
NOT HISPANIC OR LATINO           NaN          NaN          NaN      0
BLACK OR AFRICAN AMERICAN        NaN          NaN          NaN      1
mean                       22.926762          NaN          NaN      2
n                           1.000000          NaN          NaN      2
deviation                        NaN          NaN          NaN      2
Q1                         22.926762          NaN          NaN      2
WHITE                            NaN          NaN          NaN      1
mean                       36.627881    38.203970    34.934976      2
n                           3.000000     1.000000     2.000000      2
deviation                   9.087438          NaN     4.398485      2
Q1                         31.381246    38.203970    31.840329      2
    

Upvotes: 2

Related Questions