Allende
Allende

Reputation: 1482

Set MultiIndex when aggregating using pandas GroupBy

Some excercise is asking to "add a secondary name" to a df column when applying an aggregation function.

Given following DF:

stype = ['s1','s1','s1','s1',
    's2','s2','s2','s2']
dtype = ['d1','d1','d2','d2',
        'd1','d1','d2','d2']
qty = [2, 1.4, 7, 3, 
        4, 1, 2.0, 3]
df = pd.DataFrame({'s_type':stype,
                    'd_type':dtype,
                    'qty':qty})

When grouping by the first 2 columns and applying an agg function like this

new_df=df.groupby(['s_type','d_type'],sort =  False).agg({'qty':'median'})

And do print(new_df)

I get this output, notice the qty is moved up a line (I assume this is normal behavior) because is the one column where agg function was applied

               qty
s_type d_type     
s1     d1      1.7
       d2      5.0
s2     d1      2.5
       d2      2.5

Question(s ) would it be

Is it possible to setc secondary column name under qty label/name ?, to get an output like this:

               qty
s_type d_type  Median   
s1     d1      1.7
       d2      5.0
s2     d1      2.5
       d2      2.5

Also Why qty moves up 1 line ? (I'm using Spyder and I've seen this in the Anaconda/python console as well, not sure if its like a feature or pandas normal behavior).

One Last update: in case you need to use sort_values you will have to use the tuple for the column you add a secondary index/name, pretty much in the same way when applying the aggregate function (see the answer)

new_df=new_df.sort_values([('qty','MEDIAN')])

So you get the DF sorted by the column where agg function was applied and where we add an index

                 qty
              MEDIAN
s_type d_type       
s1     d1        1.7
s2     d1        2.5
       d2        2.5
s1     d2        5.0

python v3.6

pandas v0.23

Thank you.

Upvotes: 2

Views: 63

Answers (1)

cs95
cs95

Reputation: 402253

Super simple. Pass "median" as a list inside agg:

result = df.groupby(['s_type','d_type'], sort=False).agg({'qty': ['median']})
result
                 qty
              median
s_type d_type       
s1     d1        1.7
       d2        5.0
s2     d1        2.5
       d2        2.5

As for "why qty moves up 1 line", that means the first two columns are actually the index in the output. Querying result.index will make it obvious:

result.index

MultiIndex(levels=[['s1', 's2'], ['d1', 'd2']],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=['s_type', 'd_type'])

You can instead specify as_index=False when calling groupby,

df.groupby(['s_type','d_type'], sort=False, as_index=False).agg({'qty': ['median']})

  s_type d_type    qty
                median
0     s1     d1    1.7
1     s1     d2    5.0
2     s2     d1    2.5
3     s2     d2    2.5

So the groupers remain columns in the output.


You can change the name in the output by passing a list of tuples to agg:

df.groupby(['s_type','d_type'], sort=False).agg(
    {'qty': [('MEDIAN', 'median')]}
)

                 qty
              MEDIAN
s_type d_type       
s1     d1        1.7
       d2        5.0
s2     d1        2.5
       d2        2.5

Upvotes: 2

Related Questions