fluffy
fluffy

Reputation: 3

Rolling average for forward values keeps breaking

dataset sample: https://pastebin.com/DeLhAsKL

if df['Actual_v_Predicted'] == "Predicted", it is meant to do a rolling mean average of the last 7 values and forward fill the ratio column, however, this keeps breaking after reaching 01/07/2025.

the way I've done this atm is

df['ratio'] = np.where(df['Actual_v_Predicted'] == "Predicted", np.nan,df['ratio'] )
df['ratio'] = df['ratio'].fillna(df['ratio'].rolling(window=7, min_periods=1).mean())

I'm not sure why its breaking, but any help would be appreciated.

Upvotes: -1

Views: 108

Answers (2)

Mario
Mario

Reputation: 1976

I tried the following using parsing Date column while you read data as well as casting the columns\features properly otherwise you need to sort records and probably transform the date format:

import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore') # setting ignore as a parameter

# Replace 'your_file.txt' with the actual file name
df = pd.read_csv('dataset.txt', sep='\t', names=['true/false', 'Date', 'ratio', 'Actual_v_Predicted'], parse_dates=['Date'])[1:]  # Skip the first row however skiprows=1 ruin the plot!!

# cast the features
df['true/false'] = df['true/false'].astype(bool)
df['ratio'] = pd.to_numeric(df['ratio'], errors='coerce')
df['Actual_v_Predicted'] = df['Actual_v_Predicted'].astype(str)
#df

# Assuming 'Actual_v_Predicted' column contains values like "Actual" and "Predicted"
predicted_data = df[df['Actual_v_Predicted'] == 'Predicted']
predicted_data['ratio'] = predicted_data['ratio'].rolling(window=7, min_periods=1).mean().ffill()

# Update the original DataFrame with the modified 'ratio' values
#df.update(predicted_data[['ratio']])


plt.figure(figsize=(12, 6))  # Adjust the figure size as needed

# Plot the original data
plt.plot(df[df['Actual_v_Predicted'] == 'Actual']['Date'], df[df['Actual_v_Predicted'] == 'Actual']['ratio'], label='Actual', marker='o', linestyle='-', color='blue')
plt.plot(df[df['Actual_v_Predicted'] == 'Predicted']['Date'], df[df['Actual_v_Predicted'] == 'Predicted']['ratio'], label='Predicted (Original)', marker='x', linestyle='--', color='red')

# Plot the predicted_data with rolling average
plt.plot(predicted_data['Date'], predicted_data['ratio'], label='Predicted (Rolling Avg)', marker='o', markerfacecolor='none', markersize=10, linestyle='--', color='green')

# Filter rows where 'ratio' is NaN
nan_rows = df[df['ratio'].isnull()]

# Fill NaN values with the mean of the 'ratio' column
nan_rows['ratio'].fillna(df['ratio'].mean(), inplace=True)

# Plot the NaN rows with the filled values just for show case
plt.plot(nan_rows['Date'], nan_rows['ratio']-20000, color='black', marker='v', label='Missing Values (NaN)', linestyle='dotted')


plt.xlabel('Date')
plt.ylabel('Ratio')
plt.title('Actual vs Predicted Ratio with Rolling Average of the last 7 values')
plt.xticks(rotation=45)  # Rotate x-axis labels by 45 degrees
plt.legend()
plt.show()

img

Note: The disconnection between the red and blue parts is that I tried to differentiate ratio values with respect to the last column values (Actual or Predicted) with 2 colors otherwise it'd be continuous.

Upvotes: 0

mrconcerned
mrconcerned

Reputation: 1965

There are several things you might need to tackle:

  1. Sorting date values
df = df.sort_values(by='Date')
  1. Rolling Window Forward Fill

The rolling mean is computed on previous values within a defined window. If all previous values are NaN, the rolling mean will also produce NaN. You may need to forward-fill after applying the rolling mean.

So, instead of directly computing the rolling mean and assigning it back, you can try forward-filling explicitly for missing values after rolling mean:

df['ratio'] = np.where(df['Actual_v_Predicted'] == "Predicted", np.nan, df['ratio'])

# Compute rolling mean only on 'Actual' rows (already handled by np.where)
rolling_mean = df['ratio'].rolling(window=7, min_periods=1).mean()

# Fill in 'Predicted' rows using rolling mean and then forward fill if needed
df['ratio'] = df['ratio'].fillna(rolling_mean).ffill()
  1. Might need to date format
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
  1. Check the edge cases

After 01/07/2025, if there are no "Actual" values to base the rolling mean on, you'll get NaN. In such cases, you may need to handle long gaps or future dates more carefully. One option is to always fill missing values forward:

df['ratio'] = df['ratio'].ffill()

I have tried the following code with your dataset and it executed with no issues:

import pandas as pd
import numpy as np

df = pd.read_csv('dataset.txt', sep='\t')
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')

df = df.sort_values(by='Date')
df['ratio'] = np.where(df['Actual_v_Predicted'] == "Predicted", np.nan, df['ratio'])

rolling_mean = df['ratio'].rolling(window=7, min_periods=1).mean()
df['ratio'] = df['ratio'].fillna(rolling_mean).ffill()

df.to_csv('updated_file.txt', sep='\t', index=False)

print(df)

Upvotes: 0

Related Questions