Alk90
Alk90

Reputation: 87

Dealing with columns type List[int] in python-polars

Many times I find myself in a situation where I have a DataFrame and one column have the type of List[int].

For example, I have the following DF:

df = pl.DataFrame(
    {"group": ["A", "A", "B", "B", "B", "B"], 
     "value": [[3, 2, 5], [2,2,2], [2,5,9,4], [5,4,7,5,1], [9,4,5], [2,2]]}
)

Typically, Im using the explode and group_by methods in such situations.
However, when dealing with numerous columns, the code can become somewhat 'dirtier'.

To address this, I thought to use the map_elements method:

(
    df
    .group_by('group')
    .agg(
        (pl.col('value').map_elements(lambda l: pl.concat(l)))
        )
    .with_columns(
        pl.col('value').map_elements(lambda l: pl.Series.median(l))
    )
)

Unfortunately, this approach sacrifices the parallelization benefits that Polars offers. Also its execution is quite resource-costly. In cases where I have millions of rows, execution time can stretch from seconds to minutes.

Is there a better way to work with List[int]? Is there a good way to optimize my code?

Upvotes: 1

Views: 304

Answers (1)

jqurious
jqurious

Reputation: 21580

There is an explode expression which is also available via the .flatten() alias.

(df.group_by('group')
   .agg(pl.col('value').flatten().median())
)
shape: (2, 2)
┌───────┬───────┐
│ group ┆ value │
│ ---   ┆ ---   │
│ str   ┆ f64   │
╞═══════╪═══════╡
│ B     ┆ 4.5   │
│ A     ┆ 2.0   │
└───────┴───────┘

Upvotes: 0

Related Questions