Timmy O'Mahony
Timmy O'Mahony

Reputation: 53998

Filling missing rows in groups after groupby

I've got some SQL data that I'm grouping and performing some aggregation on. It works nicely:

grouped = df.groupby(['a', 'b'])
agged = grouped.aggregate({
    c: [numpy.sum, numpy.mean, numpy.size],
    d: [numpy.sum, numpy.mean, numpy.size]
})

and

         c                         d 
         sum      mean   size      sum          mean size
a  b
25 20  107.0  0.804511  133.0  5328000  40060.150376  133
   21  110.0  0.774648  142.0  6031000  42471.830986  142
   23  126.0  0.792453  159.0  8795000  55314.465409  159
   24   72.0  0.947368   76.0  2920000  38421.052632   76
   25   54.0  0.818182   66.0  2570000  38939.393939   66
26 23  126.0  0.792453  159.0  8795000  55314.465409  159

but I want to fill all of the rows that are in a=25 but not in a=26 with zeros. In other words, something like:

         c                         d 
         sum      mean   size      sum          mean size
a  b
25 20  107.0  0.804511  133.0  5328000  40060.150376  133
   21  110.0  0.774648  142.0  6031000  42471.830986  142
   23  126.0  0.792453  159.0  8795000  55314.465409  159
   24   72.0  0.947368   76.0  2920000  38421.052632   76
   25   54.0  0.818182   66.0  2570000  38939.393939   66
26 20      0         0      0        0             0    0
   21      0         0      0        0             0    0
   23  126.0  0.792453  159.0  8795000  55314.465409  159
   24      0         0      0        0             0    0
   25      0         0      0        0             0    0

How can I do this?

Upvotes: 2

Views: 951

Answers (2)

piRSquared
piRSquared

Reputation: 294516

Consider the dataframe df

df = pd.DataFrame(
    np.random.randint(10, size=(6, 6)),
    pd.MultiIndex.from_tuples(
        [(25, 20), (25, 21), (25, 23), (25, 24), (25, 25), (26, 23)],
        names=['a', 'b']
    ),
    pd.MultiIndex.from_product(
        [['c', 'd'], ['sum', 'mean', 'size']]
    )
)

        c             d          
      sum mean size sum mean size
a  b                             
25 20   8    3    5   5    0    2
   21   3    7    8   9    2    7
   23   2    1    3   2    5    4
   24   9    0    1   7    1    6
   25   1    9    3   5    8    8
26 23   8    8    4   8    0    5

You can quickly recover all missing rows from the cartesian product with unstack(fill_value=0) followed by stack

df.unstack(fill_value=0).stack()

         c             d         
      mean size sum mean size sum
a  b                             
25 20    3    5   8    0    2   5
   21    7    8   3    2    7   9
   23    1    3   2    5    4   2
   24    0    1   9    1    6   7
   25    9    3   1    8    8   5
26 20    0    0   0    0    0   0
   21    0    0   0    0    0   0
   23    8    4   8    0    5   8
   24    0    0   0    0    0   0
   25    0    0   0    0    0   0

Note: Using fill_value=0 preserves the dtype int. Without it, when unstacked, the gaps get filled with NaN and dtypes get converted to float

Upvotes: 3

Scott Boston
Scott Boston

Reputation: 153510

print(df)

           c                         d                   
         sum      mean   size      sum          mean size
a  b                                                     
25 20  107.0  0.804511  133.0  5328000  40060.150376  133
   21  110.0  0.774648  142.0  6031000  42471.830986  142
   23  126.0  0.792453  159.0  8795000  55314.465409  159
   24   72.0  0.947368   76.0  2920000  38421.052632   76
   25   54.0  0.818182   66.0  2570000  38939.393939   66
26 23  126.0  0.792453  159.0  8795000  55314.465409  159

I like:

 df =  df.unstack().replace(np.nan,0).stack(-1)
 print(df)
                  c                           d                  
               mean   size    sum          mean   size        sum
    a  b                                                         
    25 20  0.804511  133.0  107.0  40060.150376  133.0  5328000.0
       21  0.774648  142.0  110.0  42471.830986  142.0  6031000.0
       23  0.792453  159.0  126.0  55314.465409  159.0  8795000.0
       24  0.947368   76.0   72.0  38421.052632   76.0  2920000.0
       25  0.818182   66.0   54.0  38939.393939   66.0  2570000.0
    26 20  0.000000    0.0    0.0      0.000000    0.0        0.0
       21  0.000000    0.0    0.0      0.000000    0.0        0.0
       23  0.792453  159.0  126.0  55314.465409  159.0  8795000.0
       24  0.000000    0.0    0.0      0.000000    0.0        0.0
       25  0.000000    0.0    0.0      0.000000    0.0        0.0

Upvotes: 1

Related Questions