user3646742
user3646742

Reputation: 309

How to predict missing values in python using linear regression 3 year worth of data

Hey guys so i have these 3 years worth of data from 2012~2014, however the 2014 have a missing value to it (100 rows), i'm really not too sure on how to deal with it, this is my attempt at it:

X = red2012Mob.values
y = red2014Mob.values
X = X.reshape(-1,1)
y = y.reshape(-1,1)
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  
from sklearn.linear_model import LinearRegression  
regressor = LinearRegression()  
regressor.fit(X_train, y_train)  
y_pred = regressor.predict(X_test)  

i'm not changing any data from the 2014 where it have missing value i just directly input it to the model

Upvotes: 0

Views: 2248

Answers (2)

Marcus V.
Marcus V.

Reputation: 6859

There is two ways:

  • Drop the instances with missing data (e.g. using red2012Mob.dropna(), or if it is time series, leave out complete blocks of missing data, e.g. start later in 2014).
  • Impute the missing data. Here however, you won't get a one size fits all answer, as it really depends on your data and your problem. Since you seem to have time series data, the simplest strategies for "small" holes is to us linear or constant interpolation. If time dependency is not so important, maybe the mean of the column may be a good strategy. For larger holes you may find a suitable model to fill the data. Sometimes a "naive" strategy like using the same value of a seasonality before (e.g. last monday's data for current monday) may work, or you use a KNN Imputer (either check out this sklearn PR or the package discussed here). For the simple strategies, there is also a module in the upcoming sklearn release.

In practice I usually combine methods. For instance up to some point I will try strategies of the second point, but if data is too bad it is usually better to have less "good" data than much of the imputed data.

Upvotes: 3

Rajat Mittal
Rajat Mittal

Reputation: 36

I don't know if you have data for 2013 available with you. If it is available, my first recommendation would be to use that as well. As far as data for training goes, you should only take the data for 2014 with non-missing values and then fit your model using these values. Once you get a decent cross-validation accuracy on the model, you can take the subset of data with missing values for 2014 and use that to predict values for 2014.

For better understanding, here is a small piece of sample code to subset non nan values for a list/column:

import numpy as np
a1 = [v for v in a if not np.isnan(v)]

Upvotes: 1

Related Questions