Impute time series data in python using given set of features

So my data looks like:

year, y, x1, x2, x3, x4
2009, 0.5, 0.4, 0.4, 0.9
2013, nan, 0.4, 0.5, 0.8
2020, 0.8, 0.39, 0.51, 0.7

The data is year-wise but the interval between each year is not consistent. Value of y depends both on time and the features. But in some cases y is missing which I need the most. Other features can be missing too but mostly they are all there. I have tried imputing data through df.interpolate() function but values does not fit well in the interval for most of the functions. I have tried ARIMA, LSTM and others but they do not consider input features. I have considered using regression techniques too but they do not incorporate time series nature of the data.

So what is the best approach for this case. i.e.

How to impute Time Series values based on input features?

Upvotes: -1

Answers (4)

JeeyCi

Reputation: 597

can use either

y_imputed = df['col'].fillna(method="ffill")

##Rolling Statistics Imputation can preserve the temporal dependencies in the data, which is beneficial for modeling time series
y_imputed = df['col'].fillna(df.Maturity.rolling(window=4, min_periods=1).mean().shift(1))

# linear interpolation method
y_imputed = df['col'].interpolate(method ='linear', limit_direction ='forward')

# Spline Interpolation
y_imputed = df['col'].interpolate(method='spline', limit_direction='forward', order=2)

plt.plot(df['ts'] , y_imputed, 'bo-')
plt.xlabel('ts')
plt.ylabel('res')
plt.show()

see description here

Upvotes: 0

Josh

Reputation: 1601

You can make this a regression problem create a target using a lead function. Add a variable which identifies how many [periods_ahead] the target variable is. Then you use periods_ahead as an input feature into the regression model. The downside here is that you need to also add lagged features or create time differences (target transformations) to make your target stationary manually instead of relying on a time series algorithm.

Upvotes: 1

kaihami

Reputation: 815

Interesting question, there is no rule or a good answer to your problem...

Seems you would like to predict t+n points starting from t+1, where t is your last known point.

If so, you need to:

Adjust your data: In order to predict t+1 a continuous time-series Seems your data is not regularly spaced. Therefore, there is a method called Croston, that helps to deal with intermittent data. Simple words, you can group your data to reduce long 0 data points (and unknown features). Pandas offers a good method to resample(?) your dataframe to create regular spaced time-series data (the method is called resample)

It is important to remove unknown target values (y with nans). But doing this you will loose some important information, Therefore one way is to create two models. One for data imputation to fill the unknown values y. The second for forecasting future values of y.

The first model may be represented as an AutoEncoder, where the features represents the current time. In another words, given n features predict y. Where n and y were obtained from the same time t (same row).

The second model may predict the future (forecasting), therefore after inputing the missing y values, predict the future t+n, where n exists {1 -> +inf}.

Another good approach to deal with missing values is to create three models instead of two.

The first is the above mentioned to data imputation.

After filling missing target values, use the new matrix to input a second autoencoder.

Use the hidden state of the second AE as input to the third model, this way you may have missing values, and the AE could get a compressed representation of those values using the best to predict the future.

The best architecture varies from problem to problem. For example, in your case you can just drop missing target values and get a good final model.

One adjustment that should be necessary is to input missing feature values, but I would try with missing values before adding some noise. If needed you can add the mean, median, min or max of a rolling window (use rolling method pandas).

Upvotes: 2

Harris

Reputation: 98

Did you think about blending both feature-based and time-based approaches? You can, for example, train linear regression on non-missing values and get co-efficients of features for predicting the missing value and then simple/weighted moving average/ARIMA/LSTM etc. for time component. Then assign weights to results from both of them to come up with a prediction that comes from both features and time series.

Upvotes: 1

Impute time series data in python using given set of features

Answers (4)

Related Questions