How to translate a nested groupby apply aggregation from pandas in to polars?

Question

Im transitioning from pandas, so please excuse my non-parallelized brain. Suppose we have following pandas code:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    val: np.random.randint(1,5,100) for val in ['a','b','c','d','x','y','z']
})    

df.groupby('a').apply(lambda df:
    df.sort_values('c')
      .groupby('d')
      [['x','y','z']]
      .agg(['max','mean','median'])
)

Output (with smooshed multiindex to paste it here):

a	d	0	1	2	3	4	5	6	7	8
		x	x	x	y	y	y	z	z	z
		sum	mean	median	sum	mean	median	sum	mean	median
1	1	15.0	3.75	4.0	12.0	3.0	3.5	12.0	3.0	3.0
1	2	9.0	3.0	3.0	5.0	1.666667	1.0	9.0	3.0	4.0
1	3	33.0	3.0	3.0	30.0	2.727273	3.0	27.0	2.454545	2.0
1	4	23.0	2.8750	3.0	16.0	2.0	2.0	15.0	1.8750	1.0
2	1	18.0	2.571429	2.0	13.0	1.857143	2.0	18.0	2.571429	3.0
2	2	18.0	2.0	1.0	23.0	2.555556	2.0	25.0	2.777778	3.0
2	3	11.0	3.666667	4.0	9.0	3.0	3.0	9.0	3.0	4.0
2	4	3.0	1.50	1.50	6.0	3.0	3.0	4.0	2.0	2.0
3	1	28.0	2.80	3.0	21.0	2.10	2.0	29.0	2.90	3.0
3	2	13.0	2.166667	2.0	19.0	3.166667	3.0	18.0	3.0	3.0
3	3	16.0	1.777778	2.0	22.0	2.444444	3.0	32.0	3.555556	4.0
3	4	20.0	2.222222	2.0	23.0	2.555556	2.0	23.0	2.555556	3.0
4	1	9.0	2.250	2.0	10.0	2.50	2.50	5.0	1.250	1.0
4	2	19.0	3.166667	3.0	8.0	1.333333	1.0	22.0	3.666667	4.0
4	3	10.0	2.0	1.0	14.0	2.80	3.0	15.0	3.0	3.0
4	4	9.0	2.250	2.0	12.0	3.0	3.0	10.0	2.50	2.50

How to rewrite it in polars?

The core idea of the exercise is that in apply i can do something with the whole dataframe group, e.g. sort it and then aggregate (which doesnt make sense, i know, but the idea is freedom to do whatever). Do i lose this freedom if i want my code to be parallelizable or is there a way to catch the whole group? I tried pl.all() but couldnt figure out the trick to at least sort each sub-df

Pluriscient · Accepted Answer

It's unclear why you have a groupby inside apply but the same output is generated from:

df.groupby(['a', 'd'])[['x', 'y', 'z']].agg(['max', 'mean', 'median'])

In which case it is a simple translation.

Polars does not have a MultiIndex (or Index) so we add a suffix to the column names instead.

df = pl.from_pandas(df) # convert your example to Polars

df.group_by("a", "d").agg(
    pl.col("x", "y", "z").max().name.suffix("_max"),
    pl.col("x", "y", "z").mean().name.suffix("_mean"),
    pl.col("x", "y", "z").median().name.suffix("_median")
)

shape: (16, 11)
┌─────┬─────┬───────┬───────┬───┬──────────┬──────────┬──────────┬──────────┐
│ a   ┆ b   ┆ x_max ┆ y_max ┆ … ┆ z_mean   ┆ x_median ┆ y_median ┆ z_median │
│ --- ┆ --- ┆ ---   ┆ ---   ┆   ┆ ---      ┆ ---      ┆ ---      ┆ ---      │
│ i64 ┆ i64 ┆ i64   ┆ i64   ┆   ┆ f64      ┆ f64      ┆ f64      ┆ f64      │
╞═════╪═════╪═══════╪═══════╪═══╪══════════╪══════════╪══════════╪══════════╡
│ 3   ┆ 2   ┆ 4     ┆ 4     ┆ … ┆ 2.571429 ┆ 3.0      ┆ 2.0      ┆ 3.0      │
│ 4   ┆ 1   ┆ 4     ┆ 4     ┆ … ┆ 2.625    ┆ 3.5      ┆ 3.0      ┆ 3.0      │
│ 2   ┆ 4   ┆ 4     ┆ 4     ┆ … ┆ 3.0      ┆ 2.0      ┆ 3.0      ┆ 4.0      │
│ 1   ┆ 4   ┆ 3     ┆ 3     ┆ … ┆ 1.6      ┆ 2.0      ┆ 3.0      ┆ 2.0      │
│ 3   ┆ 3   ┆ 4     ┆ 4     ┆ … ┆ 2.6      ┆ 3.0      ┆ 2.0      ┆ 3.0      │
│ …   ┆ …   ┆ …     ┆ …     ┆ … ┆ …        ┆ …        ┆ …        ┆ …        │
│ 1   ┆ 2   ┆ 4     ┆ 4     ┆ … ┆ 2.875    ┆ 2.0      ┆ 2.0      ┆ 3.0      │
│ 1   ┆ 3   ┆ 4     ┆ 4     ┆ … ┆ 2.0      ┆ 4.0      ┆ 2.0      ┆ 2.0      │
│ 1   ┆ 1   ┆ 4     ┆ 4     ┆ … ┆ 2.0      ┆ 3.0      ┆ 3.0      ┆ 1.0      │
│ 4   ┆ 3   ┆ 4     ┆ 4     ┆ … ┆ 2.2      ┆ 1.0      ┆ 3.0      ┆ 2.0      │
│ 4   ┆ 2   ┆ 4     ┆ 4     ┆ … ┆ 3.0      ┆ 1.5      ┆ 2.0      ┆ 3.0      │
└─────┴─────┴───────┴───────┴───┴──────────┴──────────┴──────────┴──────────┘

How to translate a nested groupby apply aggregation from pandas in to polars?

Answers (1)

Related Questions