Ger
Ger

Reputation: 9756

How to add a row for each subindex in pandas multiindex dataframe?

Suppose I have the following dataframe :

import pandas as pd
df = pd.DataFrame(
    {
        'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
        'office_id': list(range(1, 7)) * 2,
        'sales': [pd.np.random.randint(100000, 999999) for _ in range(12)]
    }
)

Here it is :

    office_id   sales state
0           1  903325    CA
1           2  364594    WA
2           3  737728    CO
3           4  239378    AZ
4           5  833003    CA
5           6  501536    WA
6           1  920821    CO
7           2  879602    AZ
8           3  661818    CA
9           4  548888    WA
10          5  842459    CO
11          6  906791    AZ

Now I do a groupby operation on office_id and states :

df.groupby(["office_id", "state"]).aggregate({"sales": "sum"})

This lead to :

                  sales
office_id state
1         CA     903325
          CO     920821
2         AZ     879602
          WA     364594
3         CA     661818
          CO     737728
4         AZ     239378
          WA     548888
5         CA     833003
          CO     842459
6         AZ     906791
          WA     501536

Is it possible to add a row, for each office_id, with a new index total for example which is the sum over each state of the sales column ?

I can compute it by grouping by "office_id" and sum but I obtain a new DataFrame and I do not succeed in merging it.

Upvotes: 3

Views: 2318

Answers (3)

jezrael
jezrael

Reputation: 862791

You can reshape by Series.unstack, add new column total and then reshape back by DataFrame.stack, if need MultiIndex use Series.to_frame:

df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
print (df1)
                     sales
office_id state           
1         CA      505047.0
          CO      724412.0
          total  1229459.0
2         AZ      402775.0
          WA      339803.0
          total   742578.0
3         CA      343655.0
          CO      833474.0
          total  1177129.0
4         AZ      574130.0
          WA      656577.0
          total  1230707.0
5         CA      122260.0
          CO      207717.0
          total   329977.0
6         AZ      262568.0
          WA      504491.0
          total   767059.0

df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
#cast if sales are always integers
df1.sales = df1.sales.astype(int)
print (df1)
                   sales
office_id state         
1         CA      323107
          CO      658336
          total   981443
2         AZ      273728
          WA      942249
          total  1215977
3         CA      773390
          CO      692275
          total  1465665
4         AZ      669435
          WA      735141
          total  1404576
5         CA      530182
          CO      232104
          total   762286
6         AZ      532248
          WA      951481
          total  1483729

Timings:

def jez(df):
    df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
    df1['total'] = df1.sum(axis=1)
    df1 = df1.stack().to_frame('sales')
    df1.sales = df1.sales
    return (df1)

print (jez(df))

In [339]: %timeit (df.pivot_table(index='office_id', columns='state', margins=True, margins_name='total', aggfunc='sum').stack())
100 loops, best of 3: 14.6 ms per loop

In [340]: %timeit (jez(df))
100 loops, best of 3: 2.78 ms per loop

Upvotes: 2

lmo
lmo

Reputation: 38510

you can also use concat to append the aggregated totals as follows.

pd.concat([df.groupby(["office_id", "state"]).aggregate({"sales": "sum"}),
           df.groupby(["state"]).aggregate({"sales": "sum"})
            .set_index([['Total', 'Total', 'Total', 'Total']], append=True).swaplevel(0, 1)])

which returns

                   sales
office_id state         
1         CA      914776
          CO      902173
2         AZ      605783
          WA      865189
3         CA      280203
          CO      556867
4         AZ      958747
          WA      643333
5         CA      703606
          CO      644399
6         AZ      768268
          WA      834051
Total     AZ     2332798
          CA     1898585
          CO     2103439
          WA     2342573

Here, the Data.frame is aggregated at the office-state and state levels. These are concatenated with .concat. The DataFrame aggregated to the state level must be given an additional index prior to concatnating. This is done with set_index. In addition, the indices must be swapped to conform with the office-state level DataFrame.

Upvotes: 0

Ted Petrou
Ted Petrou

Reputation: 61967

Pandas has built-in functionality to do this with pivot_table by setting the margins parameter to True. And it only sorts correctly because 'total' is lowercase and uppercase comes first.

df.pivot_table(index='office_id', columns='state', margins=True,
               margins_name='total', aggfunc='sum').stack()

                     sales
office_id state           
1         CA      415727.0
          CO      240142.0
          total   655869.0
2         AZ      126350.0
          WA      385698.0
          total   512048.0
3         CA      387320.0
          CO      487075.0
          total   874395.0
4         AZ      978018.0
          WA      878368.0
          total  1856386.0
5         CA      105057.0
          CO      852025.0
          total   957082.0
6         AZ      130853.0
          WA      435940.0
          total   566793.0
total     AZ     1235221.0
          CA      908104.0
          CO     1579242.0
          WA     1700006.0
          total  5422573.0

Upvotes: 3

Related Questions