Reputation: 73
I have a dataframe of the following format:
import pandas as pd
df = pd.DataFrame(
{
"date":["2020/01/01","2020/01/02","2020/01/03","2020/01/04","2020/01/05"],
"days_before": [["2019/12/31","2019/12/30"], ["2020/01/01", "2019/12/31"], ["2020/01/02","2020/01/01"], ["2020/01/03","2020/01/02"], ["2020/01/04","2020/01/03"]],
"count":[6,3,2,5,7]
})
print(df)
date days_before count
0 2020/01/01 [2019/12/31, 2019/12/30] 6
1 2020/01/02 [2020/01/01, 2019/12/31] 3
2 2020/01/03 [2020/01/02, 2020/01/01] 2
3 2020/01/04 [2020/01/03, 2020/01/02] 5
4 2020/01/05 [2020/01/04, 2020/01/03] 7
I am wanted to essentially add a new column "before_avg" that is the average of count for all the days in the respective days_before column.
I know I can using a .apply function to calculate this column, but I wanted to know how to vectorize this function in order to speed up execution time.
Here is my .apply code:
def calculateAvg(row):
localDf = df[df["date"].isin(row["days_before"])]["count"]
return localDf.mean()
df["before_avg"] = df.apply(lambda row: calculateAvg(row), axis=1)
print(df)
date days_before count avg
0 2020/01/01 [2019/12/31, 2019/12/30] 6 NaN
1 2020/01/02 [2020/01/01, 2019/12/31] 3 6.0
2 2020/01/03 [2020/01/02, 2020/01/01] 2 4.5
3 2020/01/04 [2020/01/03, 2020/01/02] 5 2.5
4 2020/01/05 [2020/01/04, 2020/01/03] 7 3.5
Upvotes: 1
Views: 68
Reputation: 13518
With the dataframe you provided, here is one way to it using pandas rolling:
df["avg"] = (
pd.concat([pd.Series([6]), df["count"]])[:-1]
.rolling(2)
.mean()
.reset_index(drop=True)
)
print(df)
# Output
date days_before count avg
0 2020/01/01 [2019/12/31, 2019/12/30] 6 NaN
1 2020/01/02 [2020/01/01, 2019/12/31] 3 6.0
2 2020/01/03 [2020/01/02, 2020/01/01] 2 4.5
3 2020/01/04 [2020/01/03, 2020/01/02] 5 2.5
4 2020/01/05 [2020/01/04, 2020/01/03] 7 3.5
Upvotes: 1