Reputation: 535

Calculate cumulative sum of time series X for time points in series Y

Imagine transactions, identified by amount, arriving throughout the day. You want to calculate the running total of amount at given points in time (9 am, 10 am, etc.).

With pandas, I would use apply to perform such an operation. With Polars, I tried using map_elements. I have also considered group_by_dynamic but I am not sure it gives me control of the time grid's start / end / increment.

Is there a better way?

import polars as pl
import datetime

df = pl.DataFrame({
    "time": [
        datetime.datetime(2025, 2, 2, 11, 1),
        datetime.datetime(2025, 2, 2, 11, 2),
        datetime.datetime(2025, 2, 2, 11, 3)
    ],
    "amount": [5.0, -1, 10]  
}) 

dg = pl.DataFrame(
    pl.datetime_range(
        datetime.datetime(2025, 2, 2, 11, 0), 
        datetime.datetime(2025, 2, 2, 11, 5), 
        "1m",
        eager = True
    ),
    schema=["time"]
)

def _cumsum(dt):
    return df.filter(pl.col("time") <= dt).select(pl.col("amount")).sum().item()

dg.with_columns(
    cum_amount=pl.col("time").map_elements(_cumsum, return_dtype= pl.Float64)
)

Upvotes: 2

Answers (1)

Hericks

Reputation: 10464

This can be achieved relying purely on polar's native expressions API.

As first step, we can associate each row in df with the earliest timestamp in dg equal or later than the corresponding timestamp in df. For this, pl.DataFrame.join_asof with strategy="forward" might be used.

df.join_asof(dg, on="time", strategy="forward", coalesce=False)

shape: (3, 3)
┌─────────────────────┬────────┬─────────────────────┐
│ time                ┆ amount ┆ time_right          │
│ ---                 ┆ ---    ┆ ---                 │
│ datetime[μs]        ┆ f64    ┆ datetime[μs]        │
╞═════════════════════╪════════╪═════════════════════╡
│ 2025-02-02 11:01:00 ┆ 5.0    ┆ 2025-02-02 11:01:00 │
│ 2025-02-02 11:02:00 ┆ -1.0   ┆ 2025-02-02 11:02:00 │
│ 2025-02-02 11:03:00 ┆ 10.0   ┆ 2025-02-02 11:03:00 │
└─────────────────────┴────────┴─────────────────────┘

Next, we can join use these timestamps to join the amount values to dg.

dg.join(
    (
        df
        .join_asof(dg, on="time", strategy="forward", coalesce=False)
        .select("amount", pl.col("time_right").alias("time"))
    ),
    on="time",
    how="left",
)

shape: (6, 2)
┌─────────────────────┬────────┐
│ time                ┆ amount │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ f64    │
╞═════════════════════╪════════╡
│ 2025-02-02 11:00:00 ┆ null   │
│ 2025-02-02 11:01:00 ┆ 5.0    │
│ 2025-02-02 11:02:00 ┆ -1.0   │
│ 2025-02-02 11:03:00 ┆ 10.0   │
│ 2025-02-02 11:04:00 ┆ null   │
│ 2025-02-02 11:05:00 ┆ null   │
└─────────────────────┴────────┘

Note that we rename the column in the dataframe returned by pl.DataFrame.join_asof before merging back to dg.

While not shown in this example, it might be the case that there are now duplicate rows for a given timestamp (as there might be multiple values in df associated with a given timestamp in dg). Hence, we first aggregate the amount values for each timestamp. Then, we can perform a regular cumulative sum.

(
    dg
    .join(
        (
            df
            .join_asof(dg, on="time", strategy="forward", coalesce=False)
            .select("amount", pl.col("time_right").alias("time"))
        ),
        on="time",
        how="left",
    )
    .group_by("time").agg(pl.col("amount").sum())
    .with_columns(
        pl.col("amount").cum_sum().name.prefix("cum_")
    )
)

shape: (6, 3)
┌─────────────────────┬────────┬────────────┐
│ time                ┆ amount ┆ cum_amount │
│ ---                 ┆ ---    ┆ ---        │
│ datetime[μs]        ┆ f64    ┆ f64        │
╞═════════════════════╪════════╪════════════╡
│ 2025-02-02 11:00:00 ┆ 0.0    ┆ 0.0        │
│ 2025-02-02 11:01:00 ┆ 5.0    ┆ 5.0        │
│ 2025-02-02 11:02:00 ┆ -1.0   ┆ 4.0        │
│ 2025-02-02 11:03:00 ┆ 10.0   ┆ 14.0       │
│ 2025-02-02 11:04:00 ┆ 0.0    ┆ 14.0       │
│ 2025-02-02 11:05:00 ┆ 0.0    ┆ 14.0       │
└─────────────────────┴────────┴────────────┘

Upvotes: 1

Calculate cumulative sum of time series X for time points in series Y

Answers (1)

Related Questions