Rahul
Rahul

Reputation: 147

How to use Pandas rolling with groupby and removing duplicates

I have a table like this:

date    ID    Value
1       aa     5.5
1       aa     5.5
1       bb     66
1       bb     66
2       cc     2.03
2       aa     0.1
2       aa     0.1
3       bb     7
4       dd     7
5       aa     4
5       aa     4

Some information about rows:

Date - Same date can appear in more than one row
ID - Same ID can appear in more than one row
If ID and Date are same, then value will also be same.

I want to calculate rolling().mean() of Value column. But I want to groupby date and ID, and calculate the rolling mean that, it does not take the mean of row with same date.

So this do not work, because in this, it takes rolling mean of same date twice.

df.groupby(["ID","date"])["Value"].rolling(3).mean()

I have implemented a for-loop solution but it is way slower, and I am working on millions of rows. This is my current solution.

uniqueID = df["ID"]
for idname in uniqueID:
    temp = df.loc[df["ID"] == idname].drop_duplicates(inplace=False,keep="first",subset="date").set_index('date', inplace=False)["Value"].rolling(3).mean()
    df.loc[df["ID"] == idname,"rollingmeanCol"] = pd.merge(df.loc[df["ID"]==idname,["date"]],temp,on=["date"],how="left")["Value"].values

Any faster solution without loops?

In the loop also, I am doing this query 3 times, any way to do this query 1 time only? df.loc[df["ID"] == idname]

Expected output (can be verified using the above loop code)

date   ID   Value   rollingmeanCol
1   aa     5.5        NaN
1   aa     5.5        NaN
1   bb     66.0       NaN
1   bb     66.0       NaN
2   cc     2.03       NaN
2   aa     0.1        NaN
2   aa     0.1        NaN
3   bb     7.0        NaN
4   dd     7.0        NaN
5   aa     4.0        3.1999999999999997
5   aa     4.0        3.1999999999999997

Upvotes: 0

Views: 162

Answers (1)

Toukenize
Toukenize

Reputation: 1420

You can try this:

# Calculate rolling mean

rollingmeanCol = (
    df
    .drop_duplicates()
    .sort_values('date')
    .set_index('date')
    .groupby("ID")["Value"].rolling(3).mean()
    .rename('rollingmeanCol')
)

# Merge it with your df

df = df.merge(rollingmeanCol, on=['date','ID'])

Output is the same as your expected.

Upvotes: 1

Related Questions