Reputation: 8458
Problem
Including all possible values or combinations of values in the output of a pandas groupby aggregation.
Example
Example pandas DataFrame has three columns, User
, Code
, and Subtotal
:
import pandas as pd
example_df = pd.DataFrame([['a', 1, 1], ['a', 2, 1], ['b', 1, 1], ['b', 2, 1], ['c', 1, 1], ['c', 1, 1]], columns=['User', 'Code', 'Subtotal'])
I'd like to group on User
and Code
and get a subtotal for each combination of User
and Code
.
print(example_df.groupby(['User', 'Code']).Subtotal.sum().reset_index())
The output I get is:
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
How can I include the missing combination User=='c'
and Code==2
in the table, even though it doesn't exist in example_df
?
Preferred output
Below is the preferred output, with a zero line for the User=='c'
and Code==2
combination.
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
Upvotes: 10
Views: 4005
Reputation: 635
One can achieve this in a simple way by using pandas's categorical type, which will trigger the groupby
to include all possible values of the categorical column.
import pandas as pd
example_df = pd.DataFrame([['a', 1, 1], ['a', 2, 1], ['b', 1, 1], ['b', 2, 1], ['c', 1, 1], ['c', 1, 1]], columns=['User', 'Code', 'Subtotal']
# make `User` col into category type
example_df["User"] = example_df["User"].astype("category")
# now groupby will include all `User` value by default
print(example_df.groupby(['User', 'Code']).Subtotal.sum().reset_index())
gives
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
as desired.
Reference to behaviour in docs:
Groupby will also show “unused” categories:
Note that this behaviour can be disabled by passing observed=True
to DataFrame.groupby
ref.
Upvotes: 2
Reputation: 863301
You can use unstack
with stack
:
print(example_df.groupby(['User', 'Code']).Subtotal.sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
Another solution with reindex
by MultiIndex
created from_product
:
df = example_df.groupby(['User', 'Code']).Subtotal.sum()
mux = pd.MultiIndex.from_product(df.index.levels, names=['User','Code'])
print (mux)
MultiIndex(levels=[['a', 'b', 'c'], [1, 2]],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
names=['User', 'Code'])
print (df.reindex(mux, fill_value=0).reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
Upvotes: 14