Reputation: 47
How do I create a Linear Regression Model for a time series data?
I've removed the datetime, and proceeded as the normal regression method but that showed an r-squared of -7. I have data from 13H1 to 17H2.
df:
UID BaselineHalf Metric_Type Segment rateadj_amount_usd CPI_Inflation
Exports Fixed_Invstment GDP Govt_Growth Imports Industrial_Production
Merchandise_Exports Merchandise_Imports Nominal_Retail_Sales
Private_Consumption Real_Retail_Sales WPI_Inflation
100130_Print HW 2013-12-31 Print HW CANADA_PRINT_NAMED 2212.060000
3.036892 5.99463 -1.890996 3.885646 2.970826 3.762586 4.716683
-3.32253 -2.444949 10.148924 5.35529 7.001484 2.402204
df1 = df[df['UID']== '100130_Print HW']
x = df1[['CPI_Inflation', 'Exports', 'Fixed_Invstment', 'GDP',
'Govt_Growth',
'Imports', 'Industrial_Production', 'Merchandise_Exports',
'Merchandise_Imports', 'Nominal_Retail_Sales', 'Private_Consumption',
'Real_Retail_Sales', 'WPI_Inflation']]
y = df1['rateadj_amount_usd']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=101)
lm = LinearRegression()
lm.fit(X_train,y_train)
predictions = lm.predict(X_test)
from sklearn.metrics import r2_score
coefficient_of_determination = r2_score(y_test,predictions)
Upvotes: 1
Views: 144
Reputation: 286
I see a general problem in your approach: you are trying to regress a time series but removed the time data and pulled a randomized sample from the data (with train_test_split()). However, the data points are stochastically dependent. Surely the data from a given year depends on the previous year to a very large extend. But the way you do it the model cannot use this information.
Therefore, your model performs very poorly as you can see from the R squared. Try it out using the time series data.
Upvotes: 1