Blue Moon
Blue Moon

Reputation: 4651

Keep all indexes in multilevel pandas groupby

I have the following data frame:

df = pd.DataFrame([[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], list('AAABBBBABCBDDD'), [1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3,4.5,4.6,4.7,4.7,4.8], ['1','3','3','2','4','2','5','3','6','3','5','1','1','1']]).T
df.columns = ['col1','col2','col3','col4']

When I group it I get:

df.groupby(['col4','col2']).sum()

           col1  col3
col4 col2            
1    A      1.1   1.1
     D      5.5  14.2
2    B      6.0   6.4
3    A      4.8   8.4
     C      3.4   4.5
4    B      2.5   3.3
5    B      5.2   8.6
6    B      3.4   4.3

However what I would like to have is the same second level index for each first level index. This is not possible because of missing data. What I would like to achieve is:

           col1  col3
col4 col2            
1    A      1.1   1.1
     B      0     0
     C      0     0
     D      5.5  14.2
2    A      0     0
     B      6.0   6.4
     C      0     0
     D      0     0
3    A      4.8   8.4
     B      0     0
     C      3.4   4.5
     D      0     0
4    A      0     0
     B      2.5   3.3
     C      0     0
     D      0     0
5    A      0     0
     B      5.2   8.6
     C      0     0
     D      0     0
6    A      0     0
     B      3.4   4.3
     C      0     0
     D      0     0

Upvotes: 0

Views: 1210

Answers (2)

Alex Riley
Alex Riley

Reputation: 176820

To account for the missing indices, you'll probably need to reindex the DataFrame returned by df.groupby(['col4','col2']).sum():

>>> m = pd.MultiIndex.from_product([df.col4.unique(), df.col2.unique()])
>>> df.groupby(['col4','col2']).sum().reindex(m, fill_value=0)

     col1  col3
1 A   1.1   1.1
  B   0.0   0.0
  C   0.0   0.0
  D   5.5  14.2
3 A   4.8   8.4
  B   0.0   0.0
  C   3.4   4.5
  D   0.0   0.0
2 A   0.0   0.0
  B   6.0   6.4
  C   0.0   0.0
  D   0.0   0.0
4 A   0.0   0.0
  B   2.5   3.3
  C   0.0   0.0
  D   0.0   0.0
5 A   0.0   0.0
  B   5.2   8.6
  C   0.0   0.0
  D   0.0   0.0
6 A   0.0   0.0
  B   3.4   4.3
  C   0.0   0.0
  D   0.0   0.0

Note that unique() returns the values in the order they appear in the column (e.g. '3' appears before '2' here). You could correct this by using .sortlevel(0) on the reindexed DataFrame.

Upvotes: 2

chrisb
chrisb

Reputation: 52246

You could create a new MultiIndex from your data, and reindex, like this.

In [6]: idx = pd.MultiIndex.from_product([df.col4.unique(), df.col2.unique()])

In [9]: (df.groupby(['col4','col2']).sum()
           .reindex(idx).fillna(0))
Out[9]: 
     col1  col3
1 A   1.1   1.1
  B   0.0   0.0
  C   0.0   0.0
  D   5.5  14.2
3 A   4.8   8.4
  B   0.0   0.0
  C   3.4   4.5
  D   0.0   0.0
2 A   0.0   0.0
  B   6.0   6.4
  C   0.0   0.0
  D   0.0   0.0
4 A   0.0   0.0
  B   2.5   3.3
  C   0.0   0.0
  D   0.0   0.0
5 A   0.0   0.0
  B   5.2   8.6
  C   0.0   0.0
  D   0.0   0.0
6 A   0.0   0.0
  B   3.4   4.3
  C   0.0   0.0
  D   0.0   0.0

Upvotes: 3

Related Questions