Reputation: 53998
I've got some SQL data that I'm grouping and performing some aggregation on. It works nicely:
grouped = df.groupby(['a', 'b'])
agged = grouped.aggregate({
c: [numpy.sum, numpy.mean, numpy.size],
d: [numpy.sum, numpy.mean, numpy.size]
})
and
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 23 126.0 0.792453 159.0 8795000 55314.465409 159
but I want to fill all of the rows that are in a=25
but not in a=26
with zeros. In other words, something like:
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 20 0 0 0 0 0 0
21 0 0 0 0 0 0
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 0 0 0 0 0 0
25 0 0 0 0 0 0
How can I do this?
Upvotes: 2
Views: 951
Reputation: 294516
Consider the dataframe df
df = pd.DataFrame(
np.random.randint(10, size=(6, 6)),
pd.MultiIndex.from_tuples(
[(25, 20), (25, 21), (25, 23), (25, 24), (25, 25), (26, 23)],
names=['a', 'b']
),
pd.MultiIndex.from_product(
[['c', 'd'], ['sum', 'mean', 'size']]
)
)
c d
sum mean size sum mean size
a b
25 20 8 3 5 5 0 2
21 3 7 8 9 2 7
23 2 1 3 2 5 4
24 9 0 1 7 1 6
25 1 9 3 5 8 8
26 23 8 8 4 8 0 5
You can quickly recover all missing rows from the cartesian product with unstack(fill_value=0)
followed by stack
df.unstack(fill_value=0).stack()
c d
mean size sum mean size sum
a b
25 20 3 5 8 0 2 5
21 7 8 3 2 7 9
23 1 3 2 5 4 2
24 0 1 9 1 6 7
25 9 3 1 8 8 5
26 20 0 0 0 0 0 0
21 0 0 0 0 0 0
23 8 4 8 0 5 8
24 0 0 0 0 0 0
25 0 0 0 0 0 0
Note: Using fill_value=0
preserves the dtype
int
. Without it, when unstacked, the gaps get filled with NaN
and dtypes
get converted to float
Upvotes: 3
Reputation: 153510
print(df)
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 23 126.0 0.792453 159.0 8795000 55314.465409 159
I like:
df = df.unstack().replace(np.nan,0).stack(-1)
print(df)
c d
mean size sum mean size sum
a b
25 20 0.804511 133.0 107.0 40060.150376 133.0 5328000.0
21 0.774648 142.0 110.0 42471.830986 142.0 6031000.0
23 0.792453 159.0 126.0 55314.465409 159.0 8795000.0
24 0.947368 76.0 72.0 38421.052632 76.0 2920000.0
25 0.818182 66.0 54.0 38939.393939 66.0 2570000.0
26 20 0.000000 0.0 0.0 0.000000 0.0 0.0
21 0.000000 0.0 0.0 0.000000 0.0 0.0
23 0.792453 159.0 126.0 55314.465409 159.0 8795000.0
24 0.000000 0.0 0.0 0.000000 0.0 0.0
25 0.000000 0.0 0.0 0.000000 0.0 0.0
Upvotes: 1