Reputation: 33
Blockquote
New to python and trying to complete a third order polynomial regression on some data. when I use polynomial regression I don't get the fit I am expecting. I am trying to understand why the polynomial regression in python is worse then in excel. When I fit the same data in excel I get a coefficient of determination of ≈.95 and the plot looks like a third order polynomial. However, using sickitlearn it is ≈.78 and the fit almost looks linear. Is this happening because I do not have enough data? Also does having x as datetime64[ns]type on my x-axis effect the regression? The code runs. However,I am not sure if this is a coding problem or some other problem.
I am using anaconda (python 3.7) and running the code in spyder
import operator
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
#import data
data = pd.read_excel(r'D:\Anaconda\Anaconda\XData\data.xlsx', skiprows = 0)
x=np.c_[data['Date']]
y=np.c_[data['level']]
#regression
polynomial_features= PolynomialFeatures(degree=3)
x_poly = polynomial_features.fit_transform(x)
model = LinearRegression()
model.fit(x_poly, y)
y_poly_pred = model.predict(x_poly)
#check regression stats
rmse = np.sqrt(mean_squared_error(y,y_poly_pred))
r2 = r2_score(y,y_poly_pred)
print(rmse)
print(r2)
#plot
plt.scatter(x, y, s=10)
# sort the values of x b[![enter image description here][1]][1]efore line plot
sort_axis = operator.itemgetter(0)
sorted_zip = sorted(zip(x,y_poly_pred), key=sort_axis)
x, y_poly_pred = zip(*sorted_zip)
plt.plot(x, y_poly_pred, color='m')
plt.show()
Upvotes: 2
Views: 1333
Reputation: 2936
The problem is in using datetime64[ns]
type on x-axis. There is an issue on github about how datetime64[ns]
is handled inside sklearn
. The thing is datetime64[ns]
features are scaled as features of the order of 10¹⁸ in this case:
x_poly
Out[91]:
array([[1.00000000e+00, 1.29911040e+18, 1.68768783e+36, 2.19249281e+54],
[1.00000000e+00, 1.33617600e+18, 1.78536630e+36, 2.38556361e+54],
[1.00000000e+00, 1.39129920e+18, 1.93571346e+36, 2.69315659e+54],
[1.00000000e+00, 1.41566400e+18, 2.00410456e+36, 2.83713868e+54],
[1.00000000e+00, 1.43354880e+18, 2.05506216e+36, 2.94603190e+54],
[1.00000000e+00, 1.47061440e+18, 2.16270671e+36, 3.18050764e+54],
[1.00000000e+00, 1.49670720e+18, 2.24013244e+36, 3.35282236e+54],
[1.00000000e+00, 1.51476480e+18, 2.29451240e+36, 3.47564662e+54],
[1.00000000e+00, 1.57610880e+18, 2.48411895e+36, 3.91524174e+54]])
The easiest way to handle it is to use StandardScaler
or convert datetime using pd.to_numeric
and scale it:
scaler = StandardScaler()
x_scaled = scaler.fit_transform(np.c_[data['Date']])
or simply
x_scaled = np.c_[pd.to_numeric(data['Date'])] / 10e17 # convert and scale
That gives appropriately scaled features:
x_poly = polynomial_features.fit_transform(x_scaled)
x_poly
Out[94]:
array([[1. , 1.2991104 , 1.68768783, 2.19249281],
[1. , 1.336176 , 1.7853663 , 2.38556361],
[1. , 1.3912992 , 1.93571346, 2.69315659],
[1. , 1.415664 , 2.00410456, 2.83713868],
[1. , 1.4335488 , 2.05506216, 2.9460319 ],
[1. , 1.4706144 , 2.16270671, 3.18050764],
[1. , 1.4967072 , 2.24013244, 3.35282236],
[1. , 1.5147648 , 2.2945124 , 3.47564662],
[1. , 1.5761088 , 2.48411895, 3.91524174]])
EDIT: keep your x
for plot. To make a predictions you should apply the same transformations to features you want to predict. The result will be looking like this afterwards:
x = np.c_[data['Date']]
x_scaled = np.c_[pd.to_numeric(data['Date'])] / 10e17 # convert and scale
polynomial_features = PolynomialFeatures(degree=3)
x_poly = polynomial_features.fit_transform(x_scaled)
model = LinearRegression()
model.fit(x_poly, y)
y_poly_pred = model.predict(x_poly)
# test to predict
s_test = pd.to_datetime(pd.Series(['1/1/2013', '5/5/2019']))
x_test = np.c_[s_test]
x_poly_test = polynomial_features.transform(np.c_[pd.to_numeric(s_test)] / 10e17)
y_test_pred = model.predict(x_poly_test)
plt.scatter(x, y, s=10)
# plot predictions as red dots
plt.scatter(x_test, y_test_pred, s=10, c='red')
plt.plot(x, y_poly_pred, color='m')
plt.show()
Upvotes: 2