Reputation: 9756
Suppose I have the following dataframe :
import pandas as pd
df = pd.DataFrame(
{
'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [pd.np.random.randint(100000, 999999) for _ in range(12)]
}
)
Here it is :
office_id sales state
0 1 903325 CA
1 2 364594 WA
2 3 737728 CO
3 4 239378 AZ
4 5 833003 CA
5 6 501536 WA
6 1 920821 CO
7 2 879602 AZ
8 3 661818 CA
9 4 548888 WA
10 5 842459 CO
11 6 906791 AZ
Now I do a groupby
operation on office_id
and states
:
df.groupby(["office_id", "state"]).aggregate({"sales": "sum"})
This lead to :
sales
office_id state
1 CA 903325
CO 920821
2 AZ 879602
WA 364594
3 CA 661818
CO 737728
4 AZ 239378
WA 548888
5 CA 833003
CO 842459
6 AZ 906791
WA 501536
Is it possible to add a row, for each office_id, with a new index total
for example which is the sum over each state of the sales column ?
I can compute it by grouping by "office_id"
and sum but I obtain a new DataFrame and I do not succeed in merging it.
Upvotes: 3
Views: 2318
Reputation: 862791
You can reshape by Series.unstack
, add new column total
and then reshape back by DataFrame.stack
, if need MultiIndex
use Series.to_frame
:
df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
print (df1)
sales
office_id state
1 CA 505047.0
CO 724412.0
total 1229459.0
2 AZ 402775.0
WA 339803.0
total 742578.0
3 CA 343655.0
CO 833474.0
total 1177129.0
4 AZ 574130.0
WA 656577.0
total 1230707.0
5 CA 122260.0
CO 207717.0
total 329977.0
6 AZ 262568.0
WA 504491.0
total 767059.0
df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
#cast if sales are always integers
df1.sales = df1.sales.astype(int)
print (df1)
sales
office_id state
1 CA 323107
CO 658336
total 981443
2 AZ 273728
WA 942249
total 1215977
3 CA 773390
CO 692275
total 1465665
4 AZ 669435
WA 735141
total 1404576
5 CA 530182
CO 232104
total 762286
6 AZ 532248
WA 951481
total 1483729
Timings:
def jez(df):
df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
df1.sales = df1.sales
return (df1)
print (jez(df))
In [339]: %timeit (df.pivot_table(index='office_id', columns='state', margins=True, margins_name='total', aggfunc='sum').stack())
100 loops, best of 3: 14.6 ms per loop
In [340]: %timeit (jez(df))
100 loops, best of 3: 2.78 ms per loop
Upvotes: 2
Reputation: 38510
you can also use concat
to append the aggregated totals as follows.
pd.concat([df.groupby(["office_id", "state"]).aggregate({"sales": "sum"}),
df.groupby(["state"]).aggregate({"sales": "sum"})
.set_index([['Total', 'Total', 'Total', 'Total']], append=True).swaplevel(0, 1)])
which returns
sales
office_id state
1 CA 914776
CO 902173
2 AZ 605783
WA 865189
3 CA 280203
CO 556867
4 AZ 958747
WA 643333
5 CA 703606
CO 644399
6 AZ 768268
WA 834051
Total AZ 2332798
CA 1898585
CO 2103439
WA 2342573
Here, the Data.frame is aggregated at the office-state and state levels. These are concatenated with .concat
. The DataFrame aggregated to the state level must be given an additional index prior to concatnating. This is done with set_index
. In addition, the indices must be swapped to conform with the office-state level DataFrame.
Upvotes: 0
Reputation: 61967
Pandas has built-in functionality to do this with pivot_table
by setting the margins
parameter to True
. And it only sorts correctly because 'total' is lowercase and uppercase comes first.
df.pivot_table(index='office_id', columns='state', margins=True,
margins_name='total', aggfunc='sum').stack()
sales
office_id state
1 CA 415727.0
CO 240142.0
total 655869.0
2 AZ 126350.0
WA 385698.0
total 512048.0
3 CA 387320.0
CO 487075.0
total 874395.0
4 AZ 978018.0
WA 878368.0
total 1856386.0
5 CA 105057.0
CO 852025.0
total 957082.0
6 AZ 130853.0
WA 435940.0
total 566793.0
total AZ 1235221.0
CA 908104.0
CO 1579242.0
WA 1700006.0
total 5422573.0
Upvotes: 3