when using groupby with multiple columns, how to NOT drop an unobserved combination of columns

Question

Suppose I have a pandas data frame like the one below:

df = pandas.DataFrame([[1,'A', 'First'],
                       [3,'A', 'Second'], 
                       [2,'A', 'Third'],
                       [5,'B', 'Second'], 
                       [4,'B', 'Third']])

Then, if I do the following:

df.groupby([1,2]).count()

the combination of B and First will be dropped because it is not in the original data frame.

I wonder if there is a way to NOT drop this combination and instead give it a count of 0

DSM · Accepted Answer

One way would be to create a MultiIndex which contains them all and then use that to index into your counts:

>>> g = df.groupby([1,2]).count()
>>> g.loc[pd.MultiIndex.from_product(g.index.levels)]
           0
A First    1
  Second   1
  Third    1
B First  NaN
  Second   1
  Third    1
>>> g.loc[pd.MultiIndex.from_product(g.index.levels)].fillna(0)
          0
A First   1
  Second  1
  Third   1
B First   0
  Second  1
  Third   1

This works because the new MultiIndex contains the cartesian product of the levels:

>>> g.index
MultiIndex(levels=[['A', 'B'], ['First', 'Second', 'Third']],
           labels=[[0, 0, 0, 1, 1], [0, 1, 2, 1, 2]],
           names=[1, 2])
>>> pd.MultiIndex.from_product(g.index.levels)
MultiIndex(levels=[['A', 'B'], ['First', 'Second', 'Third']],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

when using groupby with multiple columns, how to NOT drop an unobserved combination of columns

Answers (1)

Related Questions