David Andrews
David Andrews

Reputation: 11

ValueError: window must be an integer 0 or greater when using rolling window on reset_index

I am trying to compute the rolling sum of a column for each date based on each unique combination of two features. Here is an example of the code:

import pandas as pd
from tqdm import tqdm

# Example DataFrame
data = {'DATE_ADDED': ['2024-03-01', '2024-03-02', '2024-03-03', '2024-03-04', '2024-03-05', '2024-03-06'],
        'MAKE': ['Toyota', 'Toyota', 'Toyota', 'Toyota', 'Toyota', 'Toyota'],
        'MODEL': ['Camry', 'Camry', 'Camry', 'Camry', 'Camry', 'Camry'],
        'MM_1': [10, 20, 30, 40, 50, 60]}
df = pd.DataFrame(data)

# Convert 'DATE_ADDED' column to datetime
df['DATE_ADDED'] = pd.to_datetime(df['DATE_ADDED'])

# Sort DataFrame by 'DATE_ADDED'
df = df.sort_values(by='DATE_ADDED')

# Get unique combinations of MAKE and MODEL
unique_make_model_combinations = df.groupby(['MAKE', 'MODEL']).size().reset_index()
unique_combinations_list = unique_make_model_combinations[['MAKE', 'MODEL']].values.tolist()

# Create an empty DataFrame to store results
result_df = pd.DataFrame(columns=['DATE_ADDED', 'MAKE', 'MODEL', 'MM_1_sum_30d'])

# Iterate over unique combinations
for combination in tqdm(unique_combinations_list, desc='Processing'):
    make, model = combination[0], combination[1]
    
    # Filter DataFrame for the specific make and model
    filtered_df = df[(df['MAKE'] == make) & (df['MODEL'] == model)]
    
    # Reset index to avoid duplicate index labels
    filtered_df.reset_index(drop=True, inplace=True)
    
    # Calculate rolling sum for the past 30 days
    filtered_df['MM_1_sum_30d'] = filtered_df['MM_1'].rolling('30D').sum()
    
    # Append results to the result DataFrame
    result_df = pd.concat([result_df, filtered_df], ignore_index=True)

print(result_df)

I can iterate over the df to perform the calculation, but it is a large df. Each date may have multiple entries, but I only want one to take one entry from each date. So if there are multiple TOYOTA CAMRY for a given date, I only need to sum one of those MM_1 values for each period in the rolling time frame that TOYOTA CAMRY exists.

I believe it has something to do with reindexing and potentially having a non-unique index.

Any advice would be greatly appreciated!

Upvotes: 0

Views: 115

Answers (1)

DataSciRookie
DataSciRookie

Reputation: 1200

 import pandas as pd

# Example DataFrame
data = {
    'DATE_ADDED': ['2024-03-01', '2024-03-02', '2024-03-03', '2024-03-04', '2024-03-05', '2024-03-06'],
    'MAKE': ['Toyota', 'Toyota', 'Toyota', 'Toyota', 'Toyota', 'Toyota'],
    'MODEL': ['Camry', 'Camry', 'Camry', 'Camry', 'Camry', 'Camry'],
    'MM_1': [10, 20, 30, 40, 50, 60]
}
df = pd.DataFrame(data)

# Convert 'DATE_ADDED' column to datetime and ensure it's the index
df['DATE_ADDED'] = pd.to_datetime(df['DATE_ADDED'])
df.set_index('DATE_ADDED', inplace=True)

# Assuming 'MM_1' needs to be summed per day for each MAKE and MODEL combination before the rolling sum
df_daily_sum = df.groupby([pd.Grouper(freq='D'), 'MAKE', 'MODEL']).sum().reset_index()

# Calculating the rolling sum within each group requires 'DATE_ADDED' as index again
df_daily_sum.set_index('DATE_ADDED', inplace=True)

# Use transform to calculate the rolling sum and maintain alignment with the original DataFrame
df_daily_sum['MM_1_sum_30d'] = df_daily_sum.groupby(['MAKE', 'MODEL'])['MM_1'].transform(lambda x: x.rolling('30D').sum())

print(df_daily_sum.reset_index())

Upvotes: 0

Related Questions