Reputation: 6025
So my data looks like:
year, y, x1, x2, x3, x4
2009, 0.5, 0.4, 0.4, 0.9
2013, nan, 0.4, 0.5, 0.8
2020, 0.8, 0.39, 0.51, 0.7
The data is year-wise but the interval between each year is not consistent. Value of y depends both on time and the features. But in some cases y is missing which I need the most. Other features can be missing too but mostly they are all there. I have tried imputing data through df.interpolate()
function but values does not fit well in the interval for most of the functions. I have tried ARIMA, LSTM and others but they do not consider input features. I have considered using regression techniques too but they do not incorporate time series nature of the data.
So what is the best approach for this case. i.e.
How to impute Time Series values based on input features?
Upvotes: -1
Views: 1848
Reputation: 579
can use either
y_imputed = df['col'].fillna(method="ffill")
##Rolling Statistics Imputation can preserve the temporal dependencies in the data, which is beneficial for modeling time series
y_imputed = df['col'].fillna(df.Maturity.rolling(window=4, min_periods=1).mean().shift(1))
# linear interpolation method
y_imputed = df['col'].interpolate(method ='linear', limit_direction ='forward')
# Spline Interpolation
y_imputed = df['col'].interpolate(method='spline', limit_direction='forward', order=2)
plt.plot(df['ts'] , y_imputed, 'bo-')
plt.xlabel('ts')
plt.ylabel('res')
plt.show()
see description here
Upvotes: 0
Reputation: 1561
You can make this a regression problem create a target using a lead function. Add a variable which identifies how many [periods_ahead] the target variable is. Then you use periods_ahead
as an input feature into the regression model. The downside here is that you need to also add lagged features or create time differences (target transformations) to make your target stationary manually instead of relying on a time series algorithm.
Upvotes: 1
Reputation: 815
Interesting question, there is no rule or a good answer to your problem...
Seems you would like to predict t+n points starting from t+1, where t is your last known point.
If so, you need to:
It is important to remove unknown target values (y with nans). But doing this you will loose some important information, Therefore one way is to create two models. One for data imputation to fill the unknown values y. The second for forecasting future values of y.
The first model may be represented as an AutoEncoder, where the features represents the current time. In another words, given n features predict y. Where n and y were obtained from the same time t (same row).
The second model may predict the future (forecasting), therefore after inputing the missing y values, predict the future t+n, where n exists {1 -> +inf}.
Another good approach to deal with missing values is to create three models instead of two.
The first is the above mentioned to data imputation.
After filling missing target values, use the new matrix to input a second autoencoder.
Use the hidden state of the second AE as input to the third model, this way you may have missing values, and the AE could get a compressed representation of those values using the best to predict the future.
The best architecture varies from problem to problem. For example, in your case you can just drop missing target values and get a good final model.
One adjustment that should be necessary is to input missing feature values, but I would try with missing values before adding some noise. If needed you can add the mean, median, min or max of a rolling window (use rolling method pandas).
Upvotes: 2
Reputation: 98
Did you think about blending both feature-based and time-based approaches? You can, for example, train linear regression on non-missing values and get co-efficients of features for predicting the missing value and then simple/weighted moving average/ARIMA/LSTM etc. for time component. Then assign weights to results from both of them to come up with a prediction that comes from both features and time series.
Upvotes: 1