Reputation: 309
Hey guys so i have these 3 years worth of data from 2012~2014, however the 2014 have a missing value to it (100 rows), i'm really not too sure on how to deal with it, this is my attempt at it:
X = red2012Mob.values
y = red2014Mob.values
X = X.reshape(-1,1)
y = y.reshape(-1,1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
i'm not changing any data from the 2014 where it have missing value i just directly input it to the model
Upvotes: 0
Views: 2248
Reputation: 6859
There is two ways:
red2012Mob.dropna()
, or if it is time series, leave out complete blocks of missing data, e.g. start later in 2014).In practice I usually combine methods. For instance up to some point I will try strategies of the second point, but if data is too bad it is usually better to have less "good" data than much of the imputed data.
Upvotes: 3
Reputation: 36
I don't know if you have data for 2013 available with you. If it is available, my first recommendation would be to use that as well. As far as data for training goes, you should only take the data for 2014 with non-missing values and then fit your model using these values. Once you get a decent cross-validation accuracy on the model, you can take the subset of data with missing values for 2014 and use that to predict values for 2014.
For better understanding, here is a small piece of sample code to subset non nan values for a list/column:
import numpy as np
a1 = [v for v in a if not np.isnan(v)]
Upvotes: 1