flT
flT

Reputation: 131

Pandas: how to perform groupby rolling deduplicate sum in an efficient manner?

We have a dataframe indexed by time with identifier id_col, a group columns group_col and value_col="value". The value_col is updated at random intervals with different frequencies per id_col. The question is how to sum up the last updates of id_col within a group. My initial idea was to perform something like:

df["duplicates_rolling_sum"] = (
    df.groupby([id_col, group_col])[value_col]
    .rolling("30d")
    .apply(lambda x: x.iloc[:-1].sum())
    .fillna(0))

df["rolling_deduplicate_sum"] = (
    df.groupby(group_col)["value_col"].rolling("30d").sum() - 
    df.groupby(group_col)["duplicated_rolling_sum"].rolling("30d").sum()
)

However, it is extremely time taking for a large dataset (>7 mil rows) with many groups. It seems that the slow down comes from the .iloc[].sum(). Do you have an idea about more efficient ways to perform this?

Upvotes: 0

Views: 60

Answers (0)

Related Questions