Reputation: 1732
I would like to compute the total sum on each multi-index sublevel. And then, save it in the dataframe.
My current dataframe looks like:
values
first second
bar one 0.106521
two 1.964873
baz one 1.289683
two -0.696361
foo one -0.309505
two 2.890406
qux one -0.758369
two 1.302628
And the needed result is:
values
first second
bar one 0.106521
two 1.964873
total 2.071394
baz one 1.289683
two -0.696361
total 0.593322
foo one -0.309505
two 2.890406
total 2.580901
qux one -0.758369
two 1.302628
total 0.544259
total one 0.328331
two 5.461546
total 5.789877
Currently I found the folowing implementation that works. But I would like to know if there are better options. I need the fastest solution possible, because in some cases when my dataframes become huge, the computation time seems to take ages.
In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
...: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
...:
In [2]: tuples = list(zip(*arrays))
In [3]: index = MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [4]: s = Series(randn(8), index=index)
In [5]: d = {'values': s}
In [6]: df = DataFrame(d)
In [7]: for col in df.index.names:
.....: df = df.unstack(col)
.....: df[('values', 'total')] = df.sum(axis=1)
.....: df = df.stack()
.....:
Upvotes: 7
Views: 2985
Reputation: 13387
I know it's an old topic, but - I couldn't find any satisfactory solution for roll up in pandas, whereas I can actually see some value in it.
#to retain original index:
index_cols=df.index.names
df2=pd.DataFrame()
#we iterate over each sub index, except the last one - to get sub-sums
for i in range(-1,len(df.index[0])-1):
if i>=0:
df2=df2.append(df.sum(level=list(range(i+1))).reset_index(), ignore_index=True)
else: #-1 will handle the total sum
df2=df2.append(df.sum(), ignore_index=True)
#to mask the last index, for which the sub-sum was not calculated:
df2[index_cols[-1]]=np.nan
#might be done better- you can keep it as "nan" (you would comment out the below line then), which will force it to the last position in index, after sorting, or put some special character in front
df2[index_cols]=df2[index_cols].fillna("_total")
df=df.reset_index().append(df2, sort=True).set_index(index_cols).sort_values(index_cols, ascending=False)
For my sample data:
values
first second
qux two -4.0
one 2.0
_total -2.0
foo two -3.0
one 4.0
_total 1.0
baz two 5.0
one -1.0
_total 4.0
bar two -1.0
one 2.0
_total 1.0
_total _total 4.0
Upvotes: 0
Reputation: 1267
Not sure if you are still looking for an answer to this - you could try something like this, assuming your current dataframe is assigned to df
:
temp = df.pivot(index='first', columns='second', values='values')
temp['total'] = temp['one'] + temp['two']
temp.stack()
Upvotes: 1
Reputation: 54340
Quite ugly code:
In [162]:
print df
values
first second
bar one 0.370291
two 0.750565
baz one 0.148405
two 0.919973
foo one 0.121964
two 0.394017
qux one 0.883136
two 0.871792
In [163]:
print pd.concat((df.reset_index(),
df.reset_index().groupby('first').aggregate('sum').reset_index())).\
sort(['first','second']).\
fillna('total').\
set_index(['first','second'])
values
first second
bar one 0.370291
two 0.750565
total 1.120856
baz one 0.148405
two 0.919973
total 1.068378
foo one 0.121964
two 0.394017
total 0.515981
qux one 0.883136
two 0.871792
total 1.754927
Basically, since the additional rows, 'total', need to be calculated and inserted into the original dataframe, it is not going to be a one-to-one relationship between the original and resultant, neither the relationship is a many-to-one type. So, I think you have to generate the 'total' dataframe separately and concat
it with the original dataframe.
Upvotes: 0