Reputation: 3
dataset sample: https://pastebin.com/DeLhAsKL
if df['Actual_v_Predicted'] == "Predicted"
, it is meant to do a rolling mean average of the last 7 values and forward fill the ratio column, however, this keeps breaking after reaching 01/07/2025.
the way I've done this atm
is
df['ratio'] = np.where(df['Actual_v_Predicted'] == "Predicted", np.nan,df['ratio'] )
df['ratio'] = df['ratio'].fillna(df['ratio'].rolling(window=7, min_periods=1).mean())
I'm not sure why its breaking, but any help would be appreciated.
Upvotes: -1
Views: 108
Reputation: 1976
I tried the following using parsing Date
column while you read data as well as casting the columns\features properly otherwise you need to sort records and probably transform the date format:
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore') # setting ignore as a parameter
# Replace 'your_file.txt' with the actual file name
df = pd.read_csv('dataset.txt', sep='\t', names=['true/false', 'Date', 'ratio', 'Actual_v_Predicted'], parse_dates=['Date'])[1:] # Skip the first row however skiprows=1 ruin the plot!!
# cast the features
df['true/false'] = df['true/false'].astype(bool)
df['ratio'] = pd.to_numeric(df['ratio'], errors='coerce')
df['Actual_v_Predicted'] = df['Actual_v_Predicted'].astype(str)
#df
# Assuming 'Actual_v_Predicted' column contains values like "Actual" and "Predicted"
predicted_data = df[df['Actual_v_Predicted'] == 'Predicted']
predicted_data['ratio'] = predicted_data['ratio'].rolling(window=7, min_periods=1).mean().ffill()
# Update the original DataFrame with the modified 'ratio' values
#df.update(predicted_data[['ratio']])
plt.figure(figsize=(12, 6)) # Adjust the figure size as needed
# Plot the original data
plt.plot(df[df['Actual_v_Predicted'] == 'Actual']['Date'], df[df['Actual_v_Predicted'] == 'Actual']['ratio'], label='Actual', marker='o', linestyle='-', color='blue')
plt.plot(df[df['Actual_v_Predicted'] == 'Predicted']['Date'], df[df['Actual_v_Predicted'] == 'Predicted']['ratio'], label='Predicted (Original)', marker='x', linestyle='--', color='red')
# Plot the predicted_data with rolling average
plt.plot(predicted_data['Date'], predicted_data['ratio'], label='Predicted (Rolling Avg)', marker='o', markerfacecolor='none', markersize=10, linestyle='--', color='green')
# Filter rows where 'ratio' is NaN
nan_rows = df[df['ratio'].isnull()]
# Fill NaN values with the mean of the 'ratio' column
nan_rows['ratio'].fillna(df['ratio'].mean(), inplace=True)
# Plot the NaN rows with the filled values just for show case
plt.plot(nan_rows['Date'], nan_rows['ratio']-20000, color='black', marker='v', label='Missing Values (NaN)', linestyle='dotted')
plt.xlabel('Date')
plt.ylabel('Ratio')
plt.title('Actual vs Predicted Ratio with Rolling Average of the last 7 values')
plt.xticks(rotation=45) # Rotate x-axis labels by 45 degrees
plt.legend()
plt.show()
Note: The disconnection between the red and blue parts is that I tried to differentiate ratio
values with respect to the last column values (Actual
or Predicted
) with 2 colors otherwise it'd be continuous.
Upvotes: 0
Reputation: 1965
There are several things you might need to tackle:
df = df.sort_values(by='Date')
The rolling mean is computed on previous values within a defined window. If all previous values are NaN
, the rolling mean will also produce NaN
. You may need to forward-fill after applying the rolling mean.
So, instead of directly computing the rolling mean and assigning it back, you can try forward-filling explicitly for missing values after rolling mean:
df['ratio'] = np.where(df['Actual_v_Predicted'] == "Predicted", np.nan, df['ratio'])
# Compute rolling mean only on 'Actual' rows (already handled by np.where)
rolling_mean = df['ratio'].rolling(window=7, min_periods=1).mean()
# Fill in 'Predicted' rows using rolling mean and then forward fill if needed
df['ratio'] = df['ratio'].fillna(rolling_mean).ffill()
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
After 01/07/2025, if there are no "Actual" values to base the rolling mean on, you'll get NaN
. In such cases, you may need to handle long gaps or future dates more carefully. One option is to always fill missing values forward:
df['ratio'] = df['ratio'].ffill()
I have tried the following code with your dataset and it executed with no issues:
import pandas as pd
import numpy as np
df = pd.read_csv('dataset.txt', sep='\t')
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
df = df.sort_values(by='Date')
df['ratio'] = np.where(df['Actual_v_Predicted'] == "Predicted", np.nan, df['ratio'])
rolling_mean = df['ratio'].rolling(window=7, min_periods=1).mean()
df['ratio'] = df['ratio'].fillna(rolling_mean).ffill()
df.to_csv('updated_file.txt', sep='\t', index=False)
print(df)
Upvotes: 0