How to predict missing values in python using linear regression 3 year worth of data

Question

Hey guys so i have these 3 years worth of data from 2012~2014, however the 2014 have a missing value to it (100 rows), i'm really not too sure on how to deal with it, this is my attempt at it:

X = red2012Mob.values
y = red2014Mob.values
X = X.reshape(-1,1)
y = y.reshape(-1,1)
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  
from sklearn.linear_model import LinearRegression  
regressor = LinearRegression()  
regressor.fit(X_train, y_train)  
y_pred = regressor.predict(X_test)

i'm not changing any data from the 2014 where it have missing value i just directly input it to the model

Rajat Mittal · Accepted Answer

I don't know if you have data for 2013 available with you. If it is available, my first recommendation would be to use that as well. As far as data for training goes, you should only take the data for 2014 with non-missing values and then fit your model using these values. Once you get a decent cross-validation accuracy on the model, you can take the subset of data with missing values for 2014 and use that to predict values for 2014.

For better understanding, here is a small piece of sample code to subset non nan values for a list/column:

import numpy as np
a1 = [v for v in a if not np.isnan(v)]

How to predict missing values in python using linear regression 3 year worth of data

Answers (2)

Related Questions