Filling missing rows in groups after groupby

Question

I've got some SQL data that I'm grouping and performing some aggregation on. It works nicely:

grouped = df.groupby(['a', 'b'])
agged = grouped.aggregate({
    c: [numpy.sum, numpy.mean, numpy.size],
    d: [numpy.sum, numpy.mean, numpy.size]
})

and

         c                         d 
         sum      mean   size      sum          mean size
a  b
25 20  107.0  0.804511  133.0  5328000  40060.150376  133
   21  110.0  0.774648  142.0  6031000  42471.830986  142
   23  126.0  0.792453  159.0  8795000  55314.465409  159
   24   72.0  0.947368   76.0  2920000  38421.052632   76
   25   54.0  0.818182   66.0  2570000  38939.393939   66
26 23  126.0  0.792453  159.0  8795000  55314.465409  159

but I want to fill all of the rows that are in a=25 but not in a=26 with zeros. In other words, something like:

         c                         d 
         sum      mean   size      sum          mean size
a  b
25 20  107.0  0.804511  133.0  5328000  40060.150376  133
   21  110.0  0.774648  142.0  6031000  42471.830986  142
   23  126.0  0.792453  159.0  8795000  55314.465409  159
   24   72.0  0.947368   76.0  2920000  38421.052632   76
   25   54.0  0.818182   66.0  2570000  38939.393939   66
26 20      0         0      0        0             0    0
   21      0         0      0        0             0    0
   23  126.0  0.792453  159.0  8795000  55314.465409  159
   24      0         0      0        0             0    0
   25      0         0      0        0             0    0

How can I do this?

piRSquared · Accepted Answer

Consider the dataframe df

df = pd.DataFrame(
    np.random.randint(10, size=(6, 6)),
    pd.MultiIndex.from_tuples(
        [(25, 20), (25, 21), (25, 23), (25, 24), (25, 25), (26, 23)],
        names=['a', 'b']
    ),
    pd.MultiIndex.from_product(
        [['c', 'd'], ['sum', 'mean', 'size']]
    )
)

        c             d          
      sum mean size sum mean size
a  b                             
25 20   8    3    5   5    0    2
   21   3    7    8   9    2    7
   23   2    1    3   2    5    4
   24   9    0    1   7    1    6
   25   1    9    3   5    8    8
26 23   8    8    4   8    0    5

You can quickly recover all missing rows from the cartesian product with unstack(fill_value=0) followed by stack

df.unstack(fill_value=0).stack()

         c             d         
      mean size sum mean size sum
a  b                             
25 20    3    5   8    0    2   5
   21    7    8   3    2    7   9
   23    1    3   2    5    4   2
   24    0    1   9    1    6   7
   25    9    3   1    8    8   5
26 20    0    0   0    0    0   0
   21    0    0   0    0    0   0
   23    8    4   8    0    5   8
   24    0    0   0    0    0   0
   25    0    0   0    0    0   0

Note: Using fill_value=0 preserves the dtype int. Without it, when unstacked, the gaps get filled with NaN and dtypes get converted to float

Filling missing rows in groups after groupby

Answers (2)

Related Questions