Iason
Iason

Reputation: 249

Calculate last month rolling average for multiple id columns in dataframe

This question has helped me get started: Calculate 2-month rolling average for group in Pandas

However my data has the difference that I have 2 id columns.

(Note: Dates are dd/mm/yyyy format)

enter image description here

I want to calculate the average score of each specific ID, for the past month.

And I need for example for id 1000, to get it from either of the ID 1 or ID 2 columns, while calculating the average last month rating of ID 1000. So on every row of the dataset, I want to have the average score of the correct ID for the past 30 days.

My attempt:

dataset['Date'] = pandas.to_datetime(dataset['Date'])
dataset = dataset.sort_values(by=['ID 1','Date'])
dataset['ID 1 last month average rating'] = dataset.set_index('Date').groupby('ID 1', 
sort=False)['ID 1 Score'].rolling('30D', closed='left').mean().round(2).to_numpy()
dataset['ID 2 last month average rating'] = dataset.set_index('Date').groupby('ID 2', 
sort=False)['ID 2 Score'].rolling('30D', closed='left').mean().round(2).to_numpy()

It seems to be getting wrong results. Any idea what I'm doing wrong?

UPDATE:

Updated the input data.

Desired output:

Date ID 1 ID 2 ID 1 Score ID 2 Score ID1 30D avg rtg ID2 30D avg rtg
16/7/2022 1001 1000 0.89 5.35 4.045
10/7/2022 1000 901 3.42 2.67 4.67
8/7/2022 530 1000 1.52 4.67

Which means that I add 2 columns to each row, ID 1 average rating and ID 2 average rating. If you take a look at ID 1000, you see that it gets the scores from the previous month (for both ID 1 and ID 2 columns and calculates the average up until that date.

Is it possible to get this output for all IDs?

Upvotes: 2

Views: 380

Answers (2)

user21283023
user21283023

Reputation: 976

From what I understand, you would like to calculate, for each id, the average of all the scores of matching ids from previous dates (up to 30 days).


Input data:

import pandas as pd

data = {
    'date': ['16/7/2022', '10/7/2022', '8/7/2022'],
    'id1': [1001, 1000, 530],
    'id2': [1000, 901, 1000],
    'id1_score': [0.89, 3.42, 1.52],
    'id2_score': [5.35, 2.67, 4.67]
}


df = pd.DataFrame(data)

# Convert the results to datetime objects
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')

print(df)
        date   id1   id2  id1_score  id2_score
0 2022-07-16  1001  1000       0.89       5.35
1 2022-07-10  1000   901       3.42       2.67
2 2022-07-08   530  1000       1.52       4.67

Generating average score for each specific ID in the last 30 days

This might not be the most efficient way, but a possible solution to generating the id1_30d_avg_rtg (or id2_30d_avg_rtg) column is to use the .apply() method to iterate over each row.

For each row:

  1. Filter the DataFrame for rows where id1 matches the value of the applied row's id, and with its date value 30 days before the applied row's date. Extract the values of id1_score from the result.

  2. Do the same for id2, extracting the values of id2_score.

  3. Use pd.concat() to concatenate these two results together, then calculate the mean (and round the result).

I have created a function to accomplish this for both of a given row's ids at the same time:

def avg_scores(date, id1, id2, df, d=30):
    """
    Calculates the average of the scores corresponding to id1 and id2 in the last d days (30 by default)
    from the given date.
    """
    
    # Get for each date its distance from the date being considered.
    date_deltas = date - df['date']
    
    # Check whether each of these dates is in a range of 1-d days.
    deltas_check_30d = (date_deltas <= pd.Timedelta(days=d)) & (date_deltas  >  pd.Timedelta(days=0))
    
    # A function to get the id scores corresponding to the id1 and id2 that match a given id value, 
    #  and are in the time range, concatenating them together to get them as a single Series.
    gen_scores = lambda idn: pd.concat([
        df.loc[(df['id1'] == idn) & deltas_check_30d, 'id1_score'],
        df.loc[(df['id2'] == idn) & deltas_check_30d, 'id2_score']
    ])
    
    # Get ths scores corresponding to id1 and id2 (as separate columns), merging them together to form a single DataFrame.
    id_scores = pd.concat([gen_scores(id1), gen_scores(id2)], axis=1)
    
    # Output the average of each scores column, rounding the result to 2 decimal places.
    return round(id_scores.mean(), 2)

Output

Usage:

# For each row, calculate the average id score in the last 30 days for each of the two ids,
#  forming a separate column corresponding to each id.
df[['id1_30d_avg_rtg', 'id2_30d_avg_rtg']] = df.apply(lambda r: avg_scores(r['date'], r['id1'], r['id2'], df), axis=1)

print(df)

Output:

        date   id1   id2  id1_score  id2_score  id1_30d_avg_rtg  id2_30d_avg_rtg
0 2022-07-16  1001  1000       0.89       5.35              NaN             4.04
1 2022-07-10  1000   901       3.42       2.67             4.67              NaN
2 2022-07-08   530  1000       1.52       4.67              NaN              NaN

Upvotes: 1

thmslmr
thmslmr

Reputation: 1302

If I understand correctly, the first dataframe should be melt to have a line per Score, with the associated ID and Date.

# Example dataframe
df = pd.DataFrame([
    ["10/8/2022", 1000, 1001, 0.89, 5.35],
    ["10/10/2022", 1000, 673, 0.25, 1.45],
    ["9/8/2022", 1045, 832, 1.45, 2.89],
    ["10/7/2022", 1000, 901, 3.42, 2.67],
    ["8/7/2022", 530, 1000, 1.52, 4.67],
], columns=["Date", "ID 1", "ID 2", "ID 1 Score", "ID 2 Score"])
df["Date"] = pd.to_datetime(df["Date"])
Date ID 1 ID 2 ID 1 Score ID 2 Score
0 2022-10-08 00:00:00 1000 1001 0.89 5.35
1 2022-10-10 00:00:00 1000 673 0.25 1.45
2 2022-09-08 00:00:00 1045 832 1.45 2.89
3 2022-10-07 00:00:00 1000 901 3.42 2.67
4 2022-08-07 00:00:00 530 1000 1.52 4.67
# Melt both [ID 1, ID 2] , and [ID 1 Score, ID 2 Score]
mdf1 = pd.melt(df, id_vars=["Date"], value_vars=['ID 1', 'ID 2'], var_name="ID type", value_name="ID")
mdf2 = pd.melt(df, value_vars=["ID 1 Score", "ID 2 Score"], var_name="Score type", value_name="Score")

melt_df = pd.concat((mdf1, mdf2), axis=1)
melt_df = melt_df.sort_values(["ID", "Date"])
Date ID type ID Score type Score
4 2022-08-07 00:00:00 ID 1 530 ID 1 Score 1.52
7 2022-09-08 00:00:00 ID 2 832 ID 2 Score 2.89
8 2022-10-07 00:00:00 ID 2 901 ID 2 Score 2.67
9 2022-08-07 00:00:00 ID 2 1000 ID 2 Score 4.67
3 2022-10-07 00:00:00 ID 1 1000 ID 1 Score 3.42
0 2022-10-08 00:00:00 ID 1 1000 ID 1 Score 0.89
1 2022-10-10 00:00:00 ID 1 1000 ID 1 Score 0.25
5 2022-10-08 00:00:00 ID 2 1001 ID 2 Score 5.35
6 2022-10-10 00:00:00 ID 2 1001 ID 2 Score 1.45
2 2022-09-08 00:00:00 ID 1 1045 ID 1 Score 1.45

ID 1000 and 1001 are the interesting ones here (with multiple dates)

The columns of interest here are Date, ID and Score. You can now use a special grouper to group by ID + time window of 30D and compute the mean score.

result = melt_df.groupby(["ID", pd.Grouper(key='Date', freq='30D')]).agg({"Score": "mean"}).reset_index()
ID Date Score
0 530 2022-08-07 00:00:00 1.52
1 832 2022-09-06 00:00:00 2.89
2 901 2022-10-06 00:00:00 2.67
3 1000 2022-08-07 00:00:00 4.67
4 1000 2022-10-06 00:00:00 1.52
5 1001 2022-10-06 00:00:00 3.4
6 1045 2022-09-06 00:00:00 1.45

One date left for ID 1001 with an average score of 3.4, and 2 dates left for ID 1000, one per 30 days window.

Upvotes: 1

Related Questions