Reputation: 145
Im transitioning from pandas, so please excuse my non-parallelized brain. Suppose we have following pandas code:
import numpy as np
import pandas as pd
df = pd.DataFrame({
val: np.random.randint(1,5,100) for val in ['a','b','c','d','x','y','z']
})
df.groupby('a').apply(lambda df:
df.sort_values('c')
.groupby('d')
[['x','y','z']]
.agg(['max','mean','median'])
)
Output (with smooshed multiindex to paste it here):
a | d | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|---|---|
x | x | x | y | y | y | z | z | z | ||
sum | mean | median | sum | mean | median | sum | mean | median | ||
1 | 1 | 15.0 | 3.75 | 4.0 | 12.0 | 3.0 | 3.5 | 12.0 | 3.0 | 3.0 |
1 | 2 | 9.0 | 3.0 | 3.0 | 5.0 | 1.666667 | 1.0 | 9.0 | 3.0 | 4.0 |
1 | 3 | 33.0 | 3.0 | 3.0 | 30.0 | 2.727273 | 3.0 | 27.0 | 2.454545 | 2.0 |
1 | 4 | 23.0 | 2.8750 | 3.0 | 16.0 | 2.0 | 2.0 | 15.0 | 1.8750 | 1.0 |
2 | 1 | 18.0 | 2.571429 | 2.0 | 13.0 | 1.857143 | 2.0 | 18.0 | 2.571429 | 3.0 |
2 | 2 | 18.0 | 2.0 | 1.0 | 23.0 | 2.555556 | 2.0 | 25.0 | 2.777778 | 3.0 |
2 | 3 | 11.0 | 3.666667 | 4.0 | 9.0 | 3.0 | 3.0 | 9.0 | 3.0 | 4.0 |
2 | 4 | 3.0 | 1.50 | 1.50 | 6.0 | 3.0 | 3.0 | 4.0 | 2.0 | 2.0 |
3 | 1 | 28.0 | 2.80 | 3.0 | 21.0 | 2.10 | 2.0 | 29.0 | 2.90 | 3.0 |
3 | 2 | 13.0 | 2.166667 | 2.0 | 19.0 | 3.166667 | 3.0 | 18.0 | 3.0 | 3.0 |
3 | 3 | 16.0 | 1.777778 | 2.0 | 22.0 | 2.444444 | 3.0 | 32.0 | 3.555556 | 4.0 |
3 | 4 | 20.0 | 2.222222 | 2.0 | 23.0 | 2.555556 | 2.0 | 23.0 | 2.555556 | 3.0 |
4 | 1 | 9.0 | 2.250 | 2.0 | 10.0 | 2.50 | 2.50 | 5.0 | 1.250 | 1.0 |
4 | 2 | 19.0 | 3.166667 | 3.0 | 8.0 | 1.333333 | 1.0 | 22.0 | 3.666667 | 4.0 |
4 | 3 | 10.0 | 2.0 | 1.0 | 14.0 | 2.80 | 3.0 | 15.0 | 3.0 | 3.0 |
4 | 4 | 9.0 | 2.250 | 2.0 | 12.0 | 3.0 | 3.0 | 10.0 | 2.50 | 2.50 |
How to rewrite it in polars?
The core idea of the exercise is that in apply
i can do something with the whole dataframe group, e.g. sort it and then aggregate (which doesnt make sense, i know, but the idea is freedom to do whatever). Do i lose this freedom if i want my code to be parallelizable or is there a way to catch the whole group? I tried pl.all()
but couldnt figure out the trick to at least sort each sub-df
Upvotes: 2
Views: 2785
Reputation: 125
It's unclear why you have a groupby
inside apply
but the same output is generated from:
df.groupby(['a', 'd'])[['x', 'y', 'z']].agg(['max', 'mean', 'median'])
In which case it is a simple translation.
Polars does not have a MultiIndex (or Index) so we add a suffix to the column names instead.
df = pl.from_pandas(df) # convert your example to Polars
df.group_by("a", "d").agg(
pl.col("x", "y", "z").max().name.suffix("_max"),
pl.col("x", "y", "z").mean().name.suffix("_mean"),
pl.col("x", "y", "z").median().name.suffix("_median")
)
shape: (16, 11)
┌─────┬─────┬───────┬───────┬───┬──────────┬──────────┬──────────┬──────────┐
│ a ┆ b ┆ x_max ┆ y_max ┆ … ┆ z_mean ┆ x_median ┆ y_median ┆ z_median │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪═════╪═══════╪═══════╪═══╪══════════╪══════════╪══════════╪══════════╡
│ 3 ┆ 2 ┆ 4 ┆ 4 ┆ … ┆ 2.571429 ┆ 3.0 ┆ 2.0 ┆ 3.0 │
│ 4 ┆ 1 ┆ 4 ┆ 4 ┆ … ┆ 2.625 ┆ 3.5 ┆ 3.0 ┆ 3.0 │
│ 2 ┆ 4 ┆ 4 ┆ 4 ┆ … ┆ 3.0 ┆ 2.0 ┆ 3.0 ┆ 4.0 │
│ 1 ┆ 4 ┆ 3 ┆ 3 ┆ … ┆ 1.6 ┆ 2.0 ┆ 3.0 ┆ 2.0 │
│ 3 ┆ 3 ┆ 4 ┆ 4 ┆ … ┆ 2.6 ┆ 3.0 ┆ 2.0 ┆ 3.0 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 1 ┆ 2 ┆ 4 ┆ 4 ┆ … ┆ 2.875 ┆ 2.0 ┆ 2.0 ┆ 3.0 │
│ 1 ┆ 3 ┆ 4 ┆ 4 ┆ … ┆ 2.0 ┆ 4.0 ┆ 2.0 ┆ 2.0 │
│ 1 ┆ 1 ┆ 4 ┆ 4 ┆ … ┆ 2.0 ┆ 3.0 ┆ 3.0 ┆ 1.0 │
│ 4 ┆ 3 ┆ 4 ┆ 4 ┆ … ┆ 2.2 ┆ 1.0 ┆ 3.0 ┆ 2.0 │
│ 4 ┆ 2 ┆ 4 ┆ 4 ┆ … ┆ 3.0 ┆ 1.5 ┆ 2.0 ┆ 3.0 │
└─────┴─────┴───────┴───────┴───┴──────────┴──────────┴──────────┴──────────┘
Upvotes: 1