Reputation: 87
Many times I find myself in a situation where I have a DataFrame and one column have the type of List[int].
For example, I have the following DF:
df = pl.DataFrame(
{"group": ["A", "A", "B", "B", "B", "B"],
"value": [[3, 2, 5], [2,2,2], [2,5,9,4], [5,4,7,5,1], [9,4,5], [2,2]]}
)
Typically, Im using the explode
and group_by
methods in such situations.
However, when dealing with numerous columns, the code can become somewhat 'dirtier'.
To address this, I thought to use the map_elements method:
(
df
.group_by('group')
.agg(
(pl.col('value').map_elements(lambda l: pl.concat(l)))
)
.with_columns(
pl.col('value').map_elements(lambda l: pl.Series.median(l))
)
)
Unfortunately, this approach sacrifices the parallelization benefits that Polars offers. Also its execution is quite resource-costly. In cases where I have millions of rows, execution time can stretch from seconds to minutes.
Is there a better way to work with List[int]? Is there a good way to optimize my code?
Upvotes: 1
Views: 304
Reputation: 21580
There is an explode expression which is also available via the .flatten()
alias.
(df.group_by('group')
.agg(pl.col('value').flatten().median())
)
shape: (2, 2)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ str ┆ f64 │
╞═══════╪═══════╡
│ B ┆ 4.5 │
│ A ┆ 2.0 │
└───────┴───────┘
Upvotes: 0