Reputation: 13051
I'm starting with this dataframe:
df = pd.DataFrame(
[
["a", "aa", "2020-12-20", 10],
["a", "ab", "2020-12-26", 11],
["a", "aa", "2020-12-22", 10],
["b", "bb", "2020-12-25", 111],
["c", "bb", "2020-12-20", 20],
["d", "dd", "2020-12-05", 1111]
],
columns=["cat", "user", "date", "value"]
)
df["date"] = pd.to_datetime(df.date)
cat | user | date | value | |
---|---|---|---|---|
0 | a | aa | 2020-12-20 00:00:00 | 10 |
1 | a | ab | 2020-12-26 00:00:00 | 11 |
2 | a | aa | 2020-12-22 00:00:00 | 10 |
3 | b | bb | 2020-12-25 00:00:00 | 111 |
4 | c | bb | 2020-12-20 00:00:00 | 20 |
5 | d | dd | 2020-12-05 00:00:00 | 1111 |
Next, I'm running the following aggregation:
gb = (
df.set_index("date")
.groupby("cat")
.resample("W")
.agg(
{"value": "sum", "user": ["nunique", lambda x: x.unique()]}
)
.rename({"<lambda>": "unqiue_users"}, axis=1)
)
This yields a table with multiindex in the columns:
value user
sum nunique unqiue_users
cat date
a 2020-12-20 10 1 aa
2020-12-27 21 2 [aa, ab]
b 2020-12-27 111 1 bb
c 2020-12-20 20 1 bb
d 2020-12-06 1111 1 dd
Lastly, I'm trying to run aggregations on the last result like:
gb.groupby(level=0)[["value", "sum"]].mean()
I don't know how to "access" the columns that have multiindex. Any idea?
Upvotes: 5
Views: 60
Reputation: 862921
For select MultiIndex and used tuples, here is used one element list:
print (gb.groupby(level=0)[[("value", "sum")]].mean())
value
sum
cat
a 15.5
b 111.0
c 20.0
d 1111.0
Or you can use simplify solution with mean
per level:
print (gb[[("value", "sum")]].mean(level=0))
value
sum
cat
a 15.5
b 111.0
c 20.0
d 1111.0
For Series
select omit nested list:
print (gb[("value", "sum")].mean(level=0))
cat
a 15.5
b 111.0
c 20.0
d 1111.0
Name: (value, sum), dtype: float64
Your solution should be changed for avoid MultiIndex
in columns:
gb = (
df.set_index("date")
.groupby(["cat", pd.Grouper(freq='W')])
.agg(val = ("value", "sum"),
nuniq = ("user", "nunique"),
unqiue_users = ("user", lambda x: x.unique()))
)
print (gb)
val nuniq unqiue_users
cat date
a 2020-12-20 10 1 aa
2020-12-27 21 2 [ab, aa]
b 2020-12-27 111 1 bb
c 2020-12-20 20 1 bb
d 2020-12-06 1111 1 dd
print (gb['val'].mean(level=0))
cat
a 15.5
b 111.0
c 20.0
d 1111.0
Name: val, dtype: float64
Upvotes: 6