Reputation: 131
We have a dataframe indexed by time
with identifier id_col
, a group columns group_col
and value_col="value"
. The value_col
is updated at random intervals with different frequencies per id_col
. The question is how to sum up the last updates of id_col
within a group. My initial idea was to perform something like:
df["duplicates_rolling_sum"] = (
df.groupby([id_col, group_col])[value_col]
.rolling("30d")
.apply(lambda x: x.iloc[:-1].sum())
.fillna(0))
df["rolling_deduplicate_sum"] = (
df.groupby(group_col)["value_col"].rolling("30d").sum() -
df.groupby(group_col)["duplicated_rolling_sum"].rolling("30d").sum()
)
However, it is extremely time taking for a large dataset (>7 mil rows) with many groups. It seems that the slow down comes from the .iloc[].sum()
. Do you have an idea about more efficient ways to perform this?
Upvotes: 0
Views: 60