zareami10
zareami10

Reputation: 301

Conditional aggregation of rows in polars

I have some data with 2 columns in the following format:

df = pl.from_repr("""
┌──────┬─────────┐
│ nums ┆ letters │
│ ---  ┆ ---     │
│ i64  ┆ str     │
╞══════╪═════════╡
│ 160  ┆ B       │
│ 540  ┆ I       │
│ 550  ┆ I       │
│ 590  ┆ B       │
│ 370  ┆ I       │
└──────┴─────────┘
""")

And I want to merge the cells that come in the form of B in addition to consecutive I's (i.e. BI*):

shape: (2, 1)
┌─────────────────┐
│ nums            │
│ ---             │
│ list[i64]       │
╞═════════════════╡
│ [160, 540, 550] │
│ [590, 370]      │
└─────────────────┘

How can I achieve something like this with polars?

Upvotes: 1

Views: 1778

Answers (2)

ritchie46
ritchie46

Reputation: 14710

You can search for the consecutive "B" -> "I" pairs, toggle those to true and cum_sum between them.

df = pl.DataFrame({
    "nums": [160, 540, 550, 590, 370],
    "letters": ["B", "I", "I", "B", "I"]
})

(df.group_by(
    ((pl.col("letters") + pl.col("letters").shift(-1)).eq_missing("BI"))
    .cum_sum()
    .alias("cum_sum"),
    maintain_order=True,
).agg("nums"))

This outputs:

shape: (2, 2)
┌─────────┬─────────────────┐
│ cum_sum ┆ nums            │
│ ---     ┆ ---             │
│ u32     ┆ list[i64]       │
╞═════════╪═════════════════╡
│ 1       ┆ [160, 540, 550] │
│ 2       ┆ [590, 370]      │
└─────────┴─────────────────┘

Upvotes: 2

user11718531
user11718531

Reputation:

If you are happy with the for loop, that's it. I tried also with polars' data manipulation functions but I didn't manage to do it this way.

import polars as pl

df = pl.DataFrame({
    "nums": [160, 540, 550, 590, 370],
    "letters": ["B", "I", "I", "B", "I"]
})

out = []
for d, l in zip(df["nums"], df["letters"]):
    if l == "B":
        out.append([d])
    elif l == "I":
        out[-1].append(d)
    else:
        raise Exception("Unknown letter: " + l)

print(out)

Upvotes: 0

Related Questions