Reputation: 249
This question has helped me get started: Calculate 2-month rolling average for group in Pandas
However my data has the difference that I have 2 id columns.
(Note: Dates are dd/mm/yyyy format)
I want to calculate the average score of each specific ID, for the past month.
And I need for example for id 1000, to get it from either of the ID 1 or ID 2 columns, while calculating the average last month rating of ID 1000. So on every row of the dataset, I want to have the average score of the correct ID for the past 30 days.
My attempt:
dataset['Date'] = pandas.to_datetime(dataset['Date'])
dataset = dataset.sort_values(by=['ID 1','Date'])
dataset['ID 1 last month average rating'] = dataset.set_index('Date').groupby('ID 1',
sort=False)['ID 1 Score'].rolling('30D', closed='left').mean().round(2).to_numpy()
dataset['ID 2 last month average rating'] = dataset.set_index('Date').groupby('ID 2',
sort=False)['ID 2 Score'].rolling('30D', closed='left').mean().round(2).to_numpy()
It seems to be getting wrong results. Any idea what I'm doing wrong?
UPDATE:
Updated the input data.
Desired output:
Date | ID 1 | ID 2 | ID 1 Score | ID 2 Score | ID1 30D avg rtg | ID2 30D avg rtg |
---|---|---|---|---|---|---|
16/7/2022 | 1001 | 1000 | 0.89 | 5.35 | 4.045 | |
10/7/2022 | 1000 | 901 | 3.42 | 2.67 | 4.67 | |
8/7/2022 | 530 | 1000 | 1.52 | 4.67 |
Which means that I add 2 columns to each row, ID 1 average rating and ID 2 average rating. If you take a look at ID 1000, you see that it gets the scores from the previous month (for both ID 1 and ID 2 columns and calculates the average up until that date.
Is it possible to get this output for all IDs?
Upvotes: 2
Views: 380
Reputation: 976
From what I understand, you would like to calculate, for each id, the average of all the scores of matching ids from previous dates (up to 30 days).
import pandas as pd
data = {
'date': ['16/7/2022', '10/7/2022', '8/7/2022'],
'id1': [1001, 1000, 530],
'id2': [1000, 901, 1000],
'id1_score': [0.89, 3.42, 1.52],
'id2_score': [5.35, 2.67, 4.67]
}
df = pd.DataFrame(data)
# Convert the results to datetime objects
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
print(df)
date id1 id2 id1_score id2_score
0 2022-07-16 1001 1000 0.89 5.35
1 2022-07-10 1000 901 3.42 2.67
2 2022-07-08 530 1000 1.52 4.67
This might not be the most efficient way, but a possible solution to generating the id1_30d_avg_rtg
(or id2_30d_avg_rtg
) column is to use the .apply()
method to iterate over each row.
For each row:
Filter the DataFrame for rows where id1
matches the value of the applied row's id, and with its date
value 30 days before the applied row's date. Extract the values of id1_score
from the result.
Do the same for id2
, extracting the values of id2_score
.
Use pd.concat()
to concatenate these two results together, then calculate the mean (and round the result).
I have created a function to accomplish this for both of a given row's ids at the same time:
def avg_scores(date, id1, id2, df, d=30):
"""
Calculates the average of the scores corresponding to id1 and id2 in the last d days (30 by default)
from the given date.
"""
# Get for each date its distance from the date being considered.
date_deltas = date - df['date']
# Check whether each of these dates is in a range of 1-d days.
deltas_check_30d = (date_deltas <= pd.Timedelta(days=d)) & (date_deltas > pd.Timedelta(days=0))
# A function to get the id scores corresponding to the id1 and id2 that match a given id value,
# and are in the time range, concatenating them together to get them as a single Series.
gen_scores = lambda idn: pd.concat([
df.loc[(df['id1'] == idn) & deltas_check_30d, 'id1_score'],
df.loc[(df['id2'] == idn) & deltas_check_30d, 'id2_score']
])
# Get ths scores corresponding to id1 and id2 (as separate columns), merging them together to form a single DataFrame.
id_scores = pd.concat([gen_scores(id1), gen_scores(id2)], axis=1)
# Output the average of each scores column, rounding the result to 2 decimal places.
return round(id_scores.mean(), 2)
Usage:
# For each row, calculate the average id score in the last 30 days for each of the two ids,
# forming a separate column corresponding to each id.
df[['id1_30d_avg_rtg', 'id2_30d_avg_rtg']] = df.apply(lambda r: avg_scores(r['date'], r['id1'], r['id2'], df), axis=1)
print(df)
Output:
date id1 id2 id1_score id2_score id1_30d_avg_rtg id2_30d_avg_rtg
0 2022-07-16 1001 1000 0.89 5.35 NaN 4.04
1 2022-07-10 1000 901 3.42 2.67 4.67 NaN
2 2022-07-08 530 1000 1.52 4.67 NaN NaN
Upvotes: 1
Reputation: 1302
If I understand correctly, the first dataframe should be melt
to have a line per Score, with the associated ID and Date.
# Example dataframe
df = pd.DataFrame([
["10/8/2022", 1000, 1001, 0.89, 5.35],
["10/10/2022", 1000, 673, 0.25, 1.45],
["9/8/2022", 1045, 832, 1.45, 2.89],
["10/7/2022", 1000, 901, 3.42, 2.67],
["8/7/2022", 530, 1000, 1.52, 4.67],
], columns=["Date", "ID 1", "ID 2", "ID 1 Score", "ID 2 Score"])
df["Date"] = pd.to_datetime(df["Date"])
Date | ID 1 | ID 2 | ID 1 Score | ID 2 Score | |
---|---|---|---|---|---|
0 | 2022-10-08 00:00:00 | 1000 | 1001 | 0.89 | 5.35 |
1 | 2022-10-10 00:00:00 | 1000 | 673 | 0.25 | 1.45 |
2 | 2022-09-08 00:00:00 | 1045 | 832 | 1.45 | 2.89 |
3 | 2022-10-07 00:00:00 | 1000 | 901 | 3.42 | 2.67 |
4 | 2022-08-07 00:00:00 | 530 | 1000 | 1.52 | 4.67 |
# Melt both [ID 1, ID 2] , and [ID 1 Score, ID 2 Score]
mdf1 = pd.melt(df, id_vars=["Date"], value_vars=['ID 1', 'ID 2'], var_name="ID type", value_name="ID")
mdf2 = pd.melt(df, value_vars=["ID 1 Score", "ID 2 Score"], var_name="Score type", value_name="Score")
melt_df = pd.concat((mdf1, mdf2), axis=1)
melt_df = melt_df.sort_values(["ID", "Date"])
Date | ID type | ID | Score type | Score | |
---|---|---|---|---|---|
4 | 2022-08-07 00:00:00 | ID 1 | 530 | ID 1 Score | 1.52 |
7 | 2022-09-08 00:00:00 | ID 2 | 832 | ID 2 Score | 2.89 |
8 | 2022-10-07 00:00:00 | ID 2 | 901 | ID 2 Score | 2.67 |
9 | 2022-08-07 00:00:00 | ID 2 | 1000 | ID 2 Score | 4.67 |
3 | 2022-10-07 00:00:00 | ID 1 | 1000 | ID 1 Score | 3.42 |
0 | 2022-10-08 00:00:00 | ID 1 | 1000 | ID 1 Score | 0.89 |
1 | 2022-10-10 00:00:00 | ID 1 | 1000 | ID 1 Score | 0.25 |
5 | 2022-10-08 00:00:00 | ID 2 | 1001 | ID 2 Score | 5.35 |
6 | 2022-10-10 00:00:00 | ID 2 | 1001 | ID 2 Score | 1.45 |
2 | 2022-09-08 00:00:00 | ID 1 | 1045 | ID 1 Score | 1.45 |
ID 1000 and 1001 are the interesting ones here (with multiple dates)
The columns of interest here are Date, ID and Score. You can now use a special grouper to group by ID + time window of 30D and compute the mean score.
result = melt_df.groupby(["ID", pd.Grouper(key='Date', freq='30D')]).agg({"Score": "mean"}).reset_index()
ID | Date | Score | |
---|---|---|---|
0 | 530 | 2022-08-07 00:00:00 | 1.52 |
1 | 832 | 2022-09-06 00:00:00 | 2.89 |
2 | 901 | 2022-10-06 00:00:00 | 2.67 |
3 | 1000 | 2022-08-07 00:00:00 | 4.67 |
4 | 1000 | 2022-10-06 00:00:00 | 1.52 |
5 | 1001 | 2022-10-06 00:00:00 | 3.4 |
6 | 1045 | 2022-09-06 00:00:00 | 1.45 |
One date left for ID 1001 with an average score of 3.4, and 2 dates left for ID 1000, one per 30 days window.
Upvotes: 1