iliya malecki
iliya malecki

Reputation: 145

How to translate a nested groupby apply aggregation from pandas in to polars?

Im transitioning from pandas, so please excuse my non-parallelized brain. Suppose we have following pandas code:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    val: np.random.randint(1,5,100) for val in ['a','b','c','d','x','y','z']
})    

df.groupby('a').apply(lambda df:
    df.sort_values('c')
      .groupby('d')
      [['x','y','z']]
      .agg(['max','mean','median'])
)

Output (with smooshed multiindex to paste it here):

a d 0 1 2 3 4 5 6 7 8
x x x y y y z z z
sum mean median sum mean median sum mean median
1 1 15.0 3.75 4.0 12.0 3.0 3.5 12.0 3.0 3.0
1 2 9.0 3.0 3.0 5.0 1.666667 1.0 9.0 3.0 4.0
1 3 33.0 3.0 3.0 30.0 2.727273 3.0 27.0 2.454545 2.0
1 4 23.0 2.8750 3.0 16.0 2.0 2.0 15.0 1.8750 1.0
2 1 18.0 2.571429 2.0 13.0 1.857143 2.0 18.0 2.571429 3.0
2 2 18.0 2.0 1.0 23.0 2.555556 2.0 25.0 2.777778 3.0
2 3 11.0 3.666667 4.0 9.0 3.0 3.0 9.0 3.0 4.0
2 4 3.0 1.50 1.50 6.0 3.0 3.0 4.0 2.0 2.0
3 1 28.0 2.80 3.0 21.0 2.10 2.0 29.0 2.90 3.0
3 2 13.0 2.166667 2.0 19.0 3.166667 3.0 18.0 3.0 3.0
3 3 16.0 1.777778 2.0 22.0 2.444444 3.0 32.0 3.555556 4.0
3 4 20.0 2.222222 2.0 23.0 2.555556 2.0 23.0 2.555556 3.0
4 1 9.0 2.250 2.0 10.0 2.50 2.50 5.0 1.250 1.0
4 2 19.0 3.166667 3.0 8.0 1.333333 1.0 22.0 3.666667 4.0
4 3 10.0 2.0 1.0 14.0 2.80 3.0 15.0 3.0 3.0
4 4 9.0 2.250 2.0 12.0 3.0 3.0 10.0 2.50 2.50

How to rewrite it in polars?

The core idea of the exercise is that in apply i can do something with the whole dataframe group, e.g. sort it and then aggregate (which doesnt make sense, i know, but the idea is freedom to do whatever). Do i lose this freedom if i want my code to be parallelizable or is there a way to catch the whole group? I tried pl.all() but couldnt figure out the trick to at least sort each sub-df

Upvotes: 2

Views: 2785

Answers (1)

Pluriscient
Pluriscient

Reputation: 125

It's unclear why you have a groupby inside apply but the same output is generated from:

df.groupby(['a', 'd'])[['x', 'y', 'z']].agg(['max', 'mean', 'median'])

In which case it is a simple translation.

Polars does not have a MultiIndex (or Index) so we add a suffix to the column names instead.

df = pl.from_pandas(df) # convert your example to Polars
df.group_by("a", "d").agg(
    pl.col("x", "y", "z").max().name.suffix("_max"),
    pl.col("x", "y", "z").mean().name.suffix("_mean"),
    pl.col("x", "y", "z").median().name.suffix("_median")
)
shape: (16, 11)
┌─────┬─────┬───────┬───────┬───┬──────────┬──────────┬──────────┬──────────┐
│ a   ┆ b   ┆ x_max ┆ y_max ┆ … ┆ z_mean   ┆ x_median ┆ y_median ┆ z_median │
│ --- ┆ --- ┆ ---   ┆ ---   ┆   ┆ ---      ┆ ---      ┆ ---      ┆ ---      │
│ i64 ┆ i64 ┆ i64   ┆ i64   ┆   ┆ f64      ┆ f64      ┆ f64      ┆ f64      │
╞═════╪═════╪═══════╪═══════╪═══╪══════════╪══════════╪══════════╪══════════╡
│ 3   ┆ 2   ┆ 4     ┆ 4     ┆ … ┆ 2.571429 ┆ 3.0      ┆ 2.0      ┆ 3.0      │
│ 4   ┆ 1   ┆ 4     ┆ 4     ┆ … ┆ 2.625    ┆ 3.5      ┆ 3.0      ┆ 3.0      │
│ 2   ┆ 4   ┆ 4     ┆ 4     ┆ … ┆ 3.0      ┆ 2.0      ┆ 3.0      ┆ 4.0      │
│ 1   ┆ 4   ┆ 3     ┆ 3     ┆ … ┆ 1.6      ┆ 2.0      ┆ 3.0      ┆ 2.0      │
│ 3   ┆ 3   ┆ 4     ┆ 4     ┆ … ┆ 2.6      ┆ 3.0      ┆ 2.0      ┆ 3.0      │
│ …   ┆ …   ┆ …     ┆ …     ┆ … ┆ …        ┆ …        ┆ …        ┆ …        │
│ 1   ┆ 2   ┆ 4     ┆ 4     ┆ … ┆ 2.875    ┆ 2.0      ┆ 2.0      ┆ 3.0      │
│ 1   ┆ 3   ┆ 4     ┆ 4     ┆ … ┆ 2.0      ┆ 4.0      ┆ 2.0      ┆ 2.0      │
│ 1   ┆ 1   ┆ 4     ┆ 4     ┆ … ┆ 2.0      ┆ 3.0      ┆ 3.0      ┆ 1.0      │
│ 4   ┆ 3   ┆ 4     ┆ 4     ┆ … ┆ 2.2      ┆ 1.0      ┆ 3.0      ┆ 2.0      │
│ 4   ┆ 2   ┆ 4     ┆ 4     ┆ … ┆ 3.0      ┆ 1.5      ┆ 2.0      ┆ 3.0      │
└─────┴─────┴───────┴───────┴───┴──────────┴──────────┴──────────┴──────────┘

Upvotes: 1

Related Questions