Reputation: 535
Imagine transactions, identified by amount, arriving throughout the day. You want to calculate the running total of amount at given points in time (9 am, 10 am, etc.).
With pandas, I would use apply
to perform such an operation. With Polars, I tried using map_elements
. I have also considered group_by_dynamic
but I am not sure it gives me control of the time grid's start / end / increment.
Is there a better way?
import polars as pl
import datetime
df = pl.DataFrame({
"time": [
datetime.datetime(2025, 2, 2, 11, 1),
datetime.datetime(2025, 2, 2, 11, 2),
datetime.datetime(2025, 2, 2, 11, 3)
],
"amount": [5.0, -1, 10]
})
dg = pl.DataFrame(
pl.datetime_range(
datetime.datetime(2025, 2, 2, 11, 0),
datetime.datetime(2025, 2, 2, 11, 5),
"1m",
eager = True
),
schema=["time"]
)
def _cumsum(dt):
return df.filter(pl.col("time") <= dt).select(pl.col("amount")).sum().item()
dg.with_columns(
cum_amount=pl.col("time").map_elements(_cumsum, return_dtype= pl.Float64)
)
Upvotes: 2
Views: 63
Reputation: 10464
This can be achieved relying purely on polar's native expressions API.
As first step, we can associate each row in df
with the earliest timestamp in dg
equal or later than the corresponding timestamp in df
. For this, pl.DataFrame.join_asof
with strategy="forward"
might be used.
df.join_asof(dg, on="time", strategy="forward", coalesce=False)
shape: (3, 3)
┌─────────────────────┬────────┬─────────────────────┐
│ time ┆ amount ┆ time_right │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ datetime[μs] │
╞═════════════════════╪════════╪═════════════════════╡
│ 2025-02-02 11:01:00 ┆ 5.0 ┆ 2025-02-02 11:01:00 │
│ 2025-02-02 11:02:00 ┆ -1.0 ┆ 2025-02-02 11:02:00 │
│ 2025-02-02 11:03:00 ┆ 10.0 ┆ 2025-02-02 11:03:00 │
└─────────────────────┴────────┴─────────────────────┘
Next, we can join use these timestamps to join the amount values to dg
.
dg.join(
(
df
.join_asof(dg, on="time", strategy="forward", coalesce=False)
.select("amount", pl.col("time_right").alias("time"))
),
on="time",
how="left",
)
shape: (6, 2)
┌─────────────────────┬────────┐
│ time ┆ amount │
│ --- ┆ --- │
│ datetime[μs] ┆ f64 │
╞═════════════════════╪════════╡
│ 2025-02-02 11:00:00 ┆ null │
│ 2025-02-02 11:01:00 ┆ 5.0 │
│ 2025-02-02 11:02:00 ┆ -1.0 │
│ 2025-02-02 11:03:00 ┆ 10.0 │
│ 2025-02-02 11:04:00 ┆ null │
│ 2025-02-02 11:05:00 ┆ null │
└─────────────────────┴────────┘
Note that we rename the column in the dataframe returned by pl.DataFrame.join_asof
before merging back to dg
.
While not shown in this example, it might be the case that there are now duplicate rows for a given timestamp (as there might be multiple values in df
associated with a given timestamp in dg
). Hence, we first aggregate the amount values for each timestamp. Then, we can perform a regular cumulative sum.
(
dg
.join(
(
df
.join_asof(dg, on="time", strategy="forward", coalesce=False)
.select("amount", pl.col("time_right").alias("time"))
),
on="time",
how="left",
)
.group_by("time").agg(pl.col("amount").sum())
.with_columns(
pl.col("amount").cum_sum().name.prefix("cum_")
)
)
shape: (6, 3)
┌─────────────────────┬────────┬────────────┐
│ time ┆ amount ┆ cum_amount │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ f64 │
╞═════════════════════╪════════╪════════════╡
│ 2025-02-02 11:00:00 ┆ 0.0 ┆ 0.0 │
│ 2025-02-02 11:01:00 ┆ 5.0 ┆ 5.0 │
│ 2025-02-02 11:02:00 ┆ -1.0 ┆ 4.0 │
│ 2025-02-02 11:03:00 ┆ 10.0 ┆ 14.0 │
│ 2025-02-02 11:04:00 ┆ 0.0 ┆ 14.0 │
│ 2025-02-02 11:05:00 ┆ 0.0 ┆ 14.0 │
└─────────────────────┴────────┴────────────┘
Upvotes: 1