Reputation: 301
I have some data with 2 columns in the following format:
df = pl.from_repr("""
┌──────┬─────────┐
│ nums ┆ letters │
│ --- ┆ --- │
│ i64 ┆ str │
╞══════╪═════════╡
│ 160 ┆ B │
│ 540 ┆ I │
│ 550 ┆ I │
│ 590 ┆ B │
│ 370 ┆ I │
└──────┴─────────┘
""")
And I want to merge the cells that come in the form of B
in addition to consecutive I
's (i.e. BI*
):
shape: (2, 1)
┌─────────────────┐
│ nums │
│ --- │
│ list[i64] │
╞═════════════════╡
│ [160, 540, 550] │
│ [590, 370] │
└─────────────────┘
How can I achieve something like this with polars?
Upvotes: 1
Views: 1778
Reputation: 14710
You can search for the consecutive "B" -> "I"
pairs, toggle those to true
and cum_sum
between them.
df = pl.DataFrame({
"nums": [160, 540, 550, 590, 370],
"letters": ["B", "I", "I", "B", "I"]
})
(df.group_by(
((pl.col("letters") + pl.col("letters").shift(-1)).eq_missing("BI"))
.cum_sum()
.alias("cum_sum"),
maintain_order=True,
).agg("nums"))
This outputs:
shape: (2, 2)
┌─────────┬─────────────────┐
│ cum_sum ┆ nums │
│ --- ┆ --- │
│ u32 ┆ list[i64] │
╞═════════╪═════════════════╡
│ 1 ┆ [160, 540, 550] │
│ 2 ┆ [590, 370] │
└─────────┴─────────────────┘
Upvotes: 2
Reputation:
If you are happy with the for loop, that's it. I tried also with polars' data manipulation functions but I didn't manage to do it this way.
import polars as pl
df = pl.DataFrame({
"nums": [160, 540, 550, 590, 370],
"letters": ["B", "I", "I", "B", "I"]
})
out = []
for d, l in zip(df["nums"], df["letters"]):
if l == "B":
out.append([d])
elif l == "I":
out[-1].append(d)
else:
raise Exception("Unknown letter: " + l)
print(out)
Upvotes: 0