user3646742
user3646742

Reputation: 309

python - using linear regression to predict missing values

I have data from 2012-2014 with some missing months in 2014. I would like to predict those months using a linear regression model trained on the 2012/2013 data.

2014 is missing June-August and has '' as its value so i clean it up using the following code, I also change 2012,2013 to have the same shape by cutting 20 data:

data2014NaN=data2014['mob'].replace(' ', np.nan)
data2014CleanNaN = data2014NaN[data2014NaN.notnull()]
data2012[0:300]
data2013[0:300]

Then I train a linear regression model using both years as a training set.

X = pd.concat([data2012[0:300], data2013[0:300]], axis=1, join='inner')
y = data2014CleanNaN .values
y = y.reshape(-1,1)
from sklearn.model_selection import train_test_split  

# Split into 75% train and 25% test
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.75,
                                                    random_state=4)  
lm = LinearRegression()
lm.fit(X_train,y_train)
score = lm.score(X_test,y_test)
print("The prediction score on the test data is {:.2f}%".format(score*100))

However the result I got is an abysmal 4.65% and I'm not too sure on how to approach this problem, I assume I did something wrong when I cut down the data for 2012 and 2013

Here I attached the data (this is just dummy data):

2014:
date       value
29/01/2014 10
30/01/2014 20
31/01/2014 15
1/02/2014  ' '


2012:
date       value
29/01/2014 15
30/01/2014 18
31/01/2014 19
1/02/2014  50

I'm only using the value data, not sure if I'm in the right direction

Best Regards

Upvotes: 0

Views: 1216

Answers (1)

yzq
yzq

Reputation: 1

It seems that your R^2 is not so good.

Cubic Spline Interpolation might perform better than linear regression in this case.
in python this api can be called:

 import scipy.interpolate as st

source

also, if x is timestamp and y is a value, you can try time series analysis like AR or ARMA and Neural Network methods like RNN and LSTM.

LSTM samples built by keras:

model = Sequential()
model.add(LSTM(activation='tanh',input_shape = dataX[0].shape, output_dim=5, return_sequences = False))
model.add(Dense(output_dim = 1))
model.compile(optimizer='adam', loss='mae',metrics=['mse']) 
model.fit(dataX , dataY, epochs = times , batch_size=1, verbose = 2,shuffle=False)
y_pred = model.predict(dataX)

Upvotes: 0

Related Questions