Dror
Dror

Reputation: 13051

Preform aggregation(s) on multiindex columns

I'm starting with this dataframe:

df = pd.DataFrame(
    [
        ["a", "aa", "2020-12-20", 10],
        ["a", "ab", "2020-12-26", 11],
        ["a", "aa", "2020-12-22", 10],
        ["b", "bb", "2020-12-25", 111],
        ["c", "bb", "2020-12-20", 20],
        ["d", "dd", "2020-12-05", 1111]
    ],
    columns=["cat", "user", "date", "value"]
)
df["date"] = pd.to_datetime(df.date)
cat user date value
0 a aa 2020-12-20 00:00:00 10
1 a ab 2020-12-26 00:00:00 11
2 a aa 2020-12-22 00:00:00 10
3 b bb 2020-12-25 00:00:00 111
4 c bb 2020-12-20 00:00:00 20
5 d dd 2020-12-05 00:00:00 1111

Next, I'm running the following aggregation:

gb = (
    df.set_index("date")
    .groupby("cat")
    .resample("W")
    .agg(
        {"value": "sum", "user": ["nunique", lambda x: x.unique()]}
    )
    .rename({"<lambda>": "unqiue_users"}, axis=1)
)

This yields a table with multiindex in the columns:

               value    user             
                 sum nunique unqiue_users
cat date                                 
a   2020-12-20    10       1           aa
    2020-12-27    21       2     [aa, ab]
b   2020-12-27   111       1           bb
c   2020-12-20    20       1           bb
d   2020-12-06  1111       1           dd

Lastly, I'm trying to run aggregations on the last result like:

gb.groupby(level=0)[["value", "sum"]].mean()

I don't know how to "access" the columns that have multiindex. Any idea?

Upvotes: 5

Views: 60

Answers (1)

jezrael
jezrael

Reputation: 862921

For select MultiIndex and used tuples, here is used one element list:

print (gb.groupby(level=0)[[("value", "sum")]].mean())
      value
        sum
cat        
a      15.5
b     111.0
c      20.0
d    1111.0

Or you can use simplify solution with mean per level:

print (gb[[("value", "sum")]].mean(level=0))
      value
        sum
cat        
a      15.5
b     111.0
c      20.0
d    1111.0

For Series select omit nested list:

print (gb[("value", "sum")].mean(level=0))
cat
a      15.5
b     111.0
c      20.0
d    1111.0
Name: (value, sum), dtype: float64

Your solution should be changed for avoid MultiIndex in columns:

gb = (
    df.set_index("date")
    .groupby(["cat", pd.Grouper(freq='W')])
    .agg(val = ("value",  "sum"),
         nuniq = ("user", "nunique"),
         unqiue_users = ("user", lambda x: x.unique()))
    )
    
print (gb)
                 val  nuniq unqiue_users
cat date                                
a   2020-12-20    10      1           aa
    2020-12-27    21      2     [ab, aa]
b   2020-12-27   111      1           bb
c   2020-12-20    20      1           bb
d   2020-12-06  1111      1           dd


print (gb['val'].mean(level=0))
cat
a      15.5
b     111.0
c      20.0
d    1111.0
Name: val, dtype: float64

Upvotes: 6

Related Questions