Conditional aggregation of rows in polars

Question

I have some data with 2 columns in the following format:

df = pl.from_repr("""
┌──────┬─────────┐
│ nums ┆ letters │
│ ---  ┆ ---     │
│ i64  ┆ str     │
╞══════╪═════════╡
│ 160  ┆ B       │
│ 540  ┆ I       │
│ 550  ┆ I       │
│ 590  ┆ B       │
│ 370  ┆ I       │
└──────┴─────────┘
""")

And I want to merge the cells that come in the form of B in addition to consecutive I's (i.e. BI*):

shape: (2, 1)
┌─────────────────┐
│ nums            │
│ ---             │
│ list[i64]       │
╞═════════════════╡
│ [160, 540, 550] │
│ [590, 370]      │
└─────────────────┘

How can I achieve something like this with polars?

ritchie46 · Accepted Answer

You can search for the consecutive "B" -> "I" pairs, toggle those to true and cum_sum between them.

df = pl.DataFrame({
    "nums": [160, 540, 550, 590, 370],
    "letters": ["B", "I", "I", "B", "I"]
})

(df.group_by(
    ((pl.col("letters") + pl.col("letters").shift(-1)).eq_missing("BI"))
    .cum_sum()
    .alias("cum_sum"),
    maintain_order=True,
).agg("nums"))

This outputs:

shape: (2, 2)
┌─────────┬─────────────────┐
│ cum_sum ┆ nums            │
│ ---     ┆ ---             │
│ u32     ┆ list[i64]       │
╞═════════╪═════════════════╡
│ 1       ┆ [160, 540, 550] │
│ 2       ┆ [590, 370]      │
└─────────┴─────────────────┘

Conditional aggregation of rows in polars

Answers (2)

Related Questions