Reputation: 157
I have time series data with an id column and some continues value column. I want to calculate the moving weekly averages of this value per each person, in a new column. Code generate a sample dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(freq=f'{60}T',start='2020-01-01',periods=(1)*24*14))
df['col'] = np.random.random_integers(0, 250, size= df.shape[0])
df['uid'] = 1
df2 = pd.DataFrame(index=pd.date_range(freq=f'{60}T',start='2020-01-01',periods=(1)*24*14))
df2['col'] = np.random.random_integers(0, 150, size= df2.shape[0])
df2['uid'] = 2
df3=pd.concat([df, df2]).reset_index()
df3
This sample has 2 weeks of data per person, so there should be 2 average values per person. The first one will be the average of the first week, the second average will be the average of the two weeks (week1 average + week2 average)/2. Then, fill all the rows in the column with the average value of that week.
Real dataset is large, so I am looking for a solution that can be scaled. How can this be achieved?
desired outcome should look like this:
index uid col week_average
2020-01-01 00:00:00 1 104 week1_uid1_mean
2020-01-01 01:00:00 1 150 week1_uid1_mean
2020-01-01 02:00:00 1 243 week1_uid1_mean
....
2020-01-08 00:00:00 1 174 (week1_uid1_mean+week2_uid1_mean)/2
2020-01-08 01:00:00 1 24 (week1_uid1_mean+week2_uid1_mean)/2
...
Upvotes: 1
Views: 150
Reputation: 98
First, compute the week index for each row
df3["week"] = (
df3["index"] - df3.groupby("uid")["index"].transform("min")
) // pd.Timedelta(7, unit="day")
Or if the values in the index
column are identical for all persons (uid
), directly
df3["week"] = (df3["index"] - df3["index"].min()) // pd.Timedelta(
7, unit="day"
)
Then, compute the week average for each distinct couple of (uid
, week
)
week_averages = (
df3.groupby(["uid", "week"])["col"]
.mean()
.groupby("uid")
.apply(lambda x: x.rolling(len(x), min_periods=1).mean())
)
Finally, fill each row of your dataframe with the corresponding week average
df3["week_average"] = df3.apply(
lambda x: week_averages.loc[(x["uid"], x["week"])], axis=1
)
On your data, when using %timeit
, I get
39.5 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1