Reputation: 11
I am trying to compute the rolling sum of a column for each date based on each unique combination of two features. Here is an example of the code:
import pandas as pd
from tqdm import tqdm
# Example DataFrame
data = {'DATE_ADDED': ['2024-03-01', '2024-03-02', '2024-03-03', '2024-03-04', '2024-03-05', '2024-03-06'],
'MAKE': ['Toyota', 'Toyota', 'Toyota', 'Toyota', 'Toyota', 'Toyota'],
'MODEL': ['Camry', 'Camry', 'Camry', 'Camry', 'Camry', 'Camry'],
'MM_1': [10, 20, 30, 40, 50, 60]}
df = pd.DataFrame(data)
# Convert 'DATE_ADDED' column to datetime
df['DATE_ADDED'] = pd.to_datetime(df['DATE_ADDED'])
# Sort DataFrame by 'DATE_ADDED'
df = df.sort_values(by='DATE_ADDED')
# Get unique combinations of MAKE and MODEL
unique_make_model_combinations = df.groupby(['MAKE', 'MODEL']).size().reset_index()
unique_combinations_list = unique_make_model_combinations[['MAKE', 'MODEL']].values.tolist()
# Create an empty DataFrame to store results
result_df = pd.DataFrame(columns=['DATE_ADDED', 'MAKE', 'MODEL', 'MM_1_sum_30d'])
# Iterate over unique combinations
for combination in tqdm(unique_combinations_list, desc='Processing'):
make, model = combination[0], combination[1]
# Filter DataFrame for the specific make and model
filtered_df = df[(df['MAKE'] == make) & (df['MODEL'] == model)]
# Reset index to avoid duplicate index labels
filtered_df.reset_index(drop=True, inplace=True)
# Calculate rolling sum for the past 30 days
filtered_df['MM_1_sum_30d'] = filtered_df['MM_1'].rolling('30D').sum()
# Append results to the result DataFrame
result_df = pd.concat([result_df, filtered_df], ignore_index=True)
print(result_df)
I can iterate over the df to perform the calculation, but it is a large df. Each date may have multiple entries, but I only want one to take one entry from each date. So if there are multiple TOYOTA CAMRY for a given date, I only need to sum one of those MM_1 values for each period in the rolling time frame that TOYOTA CAMRY exists.
I believe it has something to do with reindexing and potentially having a non-unique index.
Any advice would be greatly appreciated!
Upvotes: 0
Views: 115
Reputation: 1200
import pandas as pd
# Example DataFrame
data = {
'DATE_ADDED': ['2024-03-01', '2024-03-02', '2024-03-03', '2024-03-04', '2024-03-05', '2024-03-06'],
'MAKE': ['Toyota', 'Toyota', 'Toyota', 'Toyota', 'Toyota', 'Toyota'],
'MODEL': ['Camry', 'Camry', 'Camry', 'Camry', 'Camry', 'Camry'],
'MM_1': [10, 20, 30, 40, 50, 60]
}
df = pd.DataFrame(data)
# Convert 'DATE_ADDED' column to datetime and ensure it's the index
df['DATE_ADDED'] = pd.to_datetime(df['DATE_ADDED'])
df.set_index('DATE_ADDED', inplace=True)
# Assuming 'MM_1' needs to be summed per day for each MAKE and MODEL combination before the rolling sum
df_daily_sum = df.groupby([pd.Grouper(freq='D'), 'MAKE', 'MODEL']).sum().reset_index()
# Calculating the rolling sum within each group requires 'DATE_ADDED' as index again
df_daily_sum.set_index('DATE_ADDED', inplace=True)
# Use transform to calculate the rolling sum and maintain alignment with the original DataFrame
df_daily_sum['MM_1_sum_30d'] = df_daily_sum.groupby(['MAKE', 'MODEL'])['MM_1'].transform(lambda x: x.rolling('30D').sum())
print(df_daily_sum.reset_index())
Upvotes: 0