Calculate last month rolling average for multiple id columns in dataframe

Question

This question has helped me get started: Calculate 2-month rolling average for group in Pandas

However my data has the difference that I have 2 id columns.

(Note: Dates are dd/mm/yyyy format)

I want to calculate the average score of each specific ID, for the past month.

And I need for example for id 1000, to get it from either of the ID 1 or ID 2 columns, while calculating the average last month rating of ID 1000. So on every row of the dataset, I want to have the average score of the correct ID for the past 30 days.

My attempt:

dataset['Date'] = pandas.to_datetime(dataset['Date'])
dataset = dataset.sort_values(by=['ID 1','Date'])
dataset['ID 1 last month average rating'] = dataset.set_index('Date').groupby('ID 1', 
sort=False)['ID 1 Score'].rolling('30D', closed='left').mean().round(2).to_numpy()
dataset['ID 2 last month average rating'] = dataset.set_index('Date').groupby('ID 2', 
sort=False)['ID 2 Score'].rolling('30D', closed='left').mean().round(2).to_numpy()

It seems to be getting wrong results. Any idea what I'm doing wrong?

UPDATE:

Updated the input data.

Desired output:

Date	ID 1	ID 2	ID 1 Score	ID 2 Score	ID1 30D avg rtg	ID2 30D avg rtg
16/7/2022	1001	1000	0.89	5.35		4.045
10/7/2022	1000	901	3.42	2.67	4.67
8/7/2022	530	1000	1.52	4.67

Which means that I add 2 columns to each row, ID 1 average rating and ID 2 average rating. If you take a look at ID 1000, you see that it gets the scores from the previous month (for both ID 1 and ID 2 columns and calculates the average up until that date.

Is it possible to get this output for all IDs?

user21283023 · Accepted Answer

From what I understand, you would like to calculate, for each id, the average of all the scores of matching ids from previous dates (up to 30 days).

Input data:

import pandas as pd

data = {
    'date': ['16/7/2022', '10/7/2022', '8/7/2022'],
    'id1': [1001, 1000, 530],
    'id2': [1000, 901, 1000],
    'id1_score': [0.89, 3.42, 1.52],
    'id2_score': [5.35, 2.67, 4.67]
}


df = pd.DataFrame(data)

# Convert the results to datetime objects
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')

print(df)

        date   id1   id2  id1_score  id2_score
0 2022-07-16  1001  1000       0.89       5.35
1 2022-07-10  1000   901       3.42       2.67
2 2022-07-08   530  1000       1.52       4.67

Generating average score for each specific ID in the last 30 days

This might not be the most efficient way, but a possible solution to generating the id1_30d_avg_rtg (or id2_30d_avg_rtg) column is to use the .apply() method to iterate over each row.

For each row:

Filter the DataFrame for rows where id1 matches the value of the applied row's id, and with its date value 30 days before the applied row's date. Extract the values of id1_score from the result.
Do the same for id2, extracting the values of id2_score.
Use pd.concat() to concatenate these two results together, then calculate the mean (and round the result).

I have created a function to accomplish this for both of a given row's ids at the same time:

def avg_scores(date, id1, id2, df, d=30):
    """
    Calculates the average of the scores corresponding to id1 and id2 in the last d days (30 by default)
    from the given date.
    """
    
    # Get for each date its distance from the date being considered.
    date_deltas = date - df['date']
    
    # Check whether each of these dates is in a range of 1-d days.
    deltas_check_30d = (date_deltas <= pd.Timedelta(days=d)) & (date_deltas  >  pd.Timedelta(days=0))
    
    # A function to get the id scores corresponding to the id1 and id2 that match a given id value, 
    #  and are in the time range, concatenating them together to get them as a single Series.
    gen_scores = lambda idn: pd.concat([
        df.loc[(df['id1'] == idn) & deltas_check_30d, 'id1_score'],
        df.loc[(df['id2'] == idn) & deltas_check_30d, 'id2_score']
    ])
    
    # Get ths scores corresponding to id1 and id2 (as separate columns), merging them together to form a single DataFrame.
    id_scores = pd.concat([gen_scores(id1), gen_scores(id2)], axis=1)
    
    # Output the average of each scores column, rounding the result to 2 decimal places.
    return round(id_scores.mean(), 2)

Output

Usage:

# For each row, calculate the average id score in the last 30 days for each of the two ids,
#  forming a separate column corresponding to each id.
df[['id1_30d_avg_rtg', 'id2_30d_avg_rtg']] = df.apply(lambda r: avg_scores(r['date'], r['id1'], r['id2'], df), axis=1)

print(df)

Output:

        date   id1   id2  id1_score  id2_score  id1_30d_avg_rtg  id2_30d_avg_rtg
0 2022-07-16  1001  1000       0.89       5.35              NaN             4.04
1 2022-07-10  1000   901       3.42       2.67             4.67              NaN
2 2022-07-08   530  1000       1.52       4.67              NaN              NaN

Calculate last month rolling average for multiple id columns in dataframe

Answers (2)

Input data:

Generating average score for each specific ID in the last 30 days

Output

Related Questions

	Date	ID 1	ID 2	ID 1 Score	ID 2 Score
0	2022-10-08 00:00:00	1000	1001	0.89	5.35
1	2022-10-10 00:00:00	1000	673	0.25	1.45
2	2022-09-08 00:00:00	1045	832	1.45	2.89
3	2022-10-07 00:00:00	1000	901	3.42	2.67
4	2022-08-07 00:00:00	530	1000	1.52	4.67

	ID	Date	Score
0	530	2022-08-07 00:00:00	1.52
1	832	2022-09-06 00:00:00	2.89
2	901	2022-10-06 00:00:00	2.67
3	1000	2022-08-07 00:00:00	4.67
4	1000	2022-10-06 00:00:00	1.52
5	1001	2022-10-06 00:00:00	3.4
6	1045	2022-09-06 00:00:00	1.45