Iulian Stana
Iulian Stana

Reputation: 1732

Pandas, Computing total sum on each MultiIndex sublevel

I would like to compute the total sum on each multi-index sublevel. And then, save it in the dataframe.

My current dataframe looks like:

                    values
    first second
    bar   one     0.106521
          two     1.964873
    baz   one     1.289683
          two    -0.696361
    foo   one    -0.309505
          two     2.890406
    qux   one    -0.758369
          two     1.302628

And the needed result is:

                    values
    first second
    bar   one     0.106521
          two     1.964873
          total   2.071394
    baz   one     1.289683
          two    -0.696361
          total   0.593322
    foo   one    -0.309505
          two     2.890406
          total   2.580901
    qux   one    -0.758369
          two     1.302628
          total   0.544259
    total one     0.328331
          two     5.461546
          total   5.789877

Currently I found the folowing implementation that works. But I would like to know if there are better options. I need the fastest solution possible, because in some cases when my dataframes become huge, the computation time seems to take ages.

In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ...:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ...: 

In [2]: tuples = list(zip(*arrays))

In [3]: index = MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [4]: s = Series(randn(8), index=index)

In [5]: d = {'values': s}

In [6]: df = DataFrame(d)

In [7]: for col in df.index.names:
   .....:     df = df.unstack(col)
   .....:     df[('values', 'total')] = df.sum(axis=1)
   .....:     df = df.stack()
   .....:

Upvotes: 7

Views: 2985

Answers (3)

Georgina Skibinski
Georgina Skibinski

Reputation: 13387

I know it's an old topic, but - I couldn't find any satisfactory solution for roll up in pandas, whereas I can actually see some value in it.

#to retain original index:
index_cols=df.index.names

df2=pd.DataFrame()
#we iterate over each sub index, except the last one - to get sub-sums
for i in range(-1,len(df.index[0])-1):
    if i>=0:
        df2=df2.append(df.sum(level=list(range(i+1))).reset_index(), ignore_index=True)
    else: #-1 will handle the total sum
        df2=df2.append(df.sum(), ignore_index=True)
#to mask the last index, for which the sub-sum was not calculated:
df2[index_cols[-1]]=np.nan

#might be done better- you can keep it as "nan" (you would comment out the below line then), which will force it to the last position in index, after sorting, or put some special character in front
df2[index_cols]=df2[index_cols].fillna("_total")

df=df.reset_index().append(df2, sort=True).set_index(index_cols).sort_values(index_cols, ascending=False)

For my sample data:

               values
first  second
qux    two       -4.0
       one        2.0
       _total    -2.0
foo    two       -3.0
       one        4.0
       _total     1.0
baz    two        5.0
       one       -1.0
       _total     4.0
bar    two       -1.0
       one        2.0
       _total     1.0
_total _total     4.0

Upvotes: 0

Sajan
Sajan

Reputation: 1267

Not sure if you are still looking for an answer to this - you could try something like this, assuming your current dataframe is assigned to df :

temp = df.pivot(index='first', columns='second', values='values')
temp['total'] = temp['one'] + temp['two']
temp.stack()

Upvotes: 1

CT Zhu
CT Zhu

Reputation: 54340

Quite ugly code:

In [162]:

print df
                values
first second          
bar   one     0.370291
      two     0.750565
baz   one     0.148405
      two     0.919973
foo   one     0.121964
      two     0.394017
qux   one     0.883136
      two     0.871792
In [163]:

print pd.concat((df.reset_index(),
                 df.reset_index().groupby('first').aggregate('sum').reset_index())).\
                      sort(['first','second']).\
                      fillna('total').\
                      set_index(['first','second'])
                values
first second          
bar   one     0.370291
      two     0.750565
      total   1.120856
baz   one     0.148405
      two     0.919973
      total   1.068378
foo   one     0.121964
      two     0.394017
      total   0.515981
qux   one     0.883136
      two     0.871792
      total   1.754927

Basically, since the additional rows, 'total', need to be calculated and inserted into the original dataframe, it is not going to be a one-to-one relationship between the original and resultant, neither the relationship is a many-to-one type. So, I think you have to generate the 'total' dataframe separately and concat it with the original dataframe.

Upvotes: 0

Related Questions